PMML

The Predictive Model Markup Language (PMML) is an XML-based language aimed at providing a way to exchange different predictive models, for classification or regression purposes, generated using a data mining or machine learning technique. PMML was originally developed by the Data Mining Group (http://www.dmg.org/) in 1997 and its latest version (4.2.1) dates from May 2014.

Even if PMML itself is not a business-oriented language, it is currently possible to generate PMML documents from a variety of well known applications, such as Knime (https://www.knime.org/), or R language (https://www.r-project.org/).

PMML support in Drools is relatively new. It originally started as an experimental module but, with effect from version 6.1, PMML is a first-class citizen of the Drools ecosystem. PMML standard can be used to encode and interchange classification and regression models, such as neural networks, decision trees, scorecards, and others. By adopting PMML, Drools has gained access to a broader set of options for knowledge representation.

Unfortunately, not all of the models supported in PMML are supported in Drools. The list of supported models is growing, and the currently supported set is this:

  • Clustering
  • Simple Regression
  • Naïve Bayes
  • Neural Networks
  • Scorecards
  • Decision Trees
  • Support Vector Machines

An explanation of each of these model types is outside the scope of this section. More information about PMML and the models it supports can be found in the Data Mining Group website (http://www.dmg.org/v4-2-1/GeneralStructure.html).

PMML in Drools

A PMML document is an XML document composed of up to four main sections: Header, Data Dictionary, Data Transformation, and Model.

The Header section contains meta-information about the document itself. Elements such as information about the model being used, the application that was used to generate it, and a creation timestamp can be found in this section.

The Data Dictionary section defines all the possible data fields the document may use. For each data field, a name and its type are specified.

A Data Transformation section can be specified in the document to define any mapping between the incoming data and the data required by the model.

The concrete type of model is specified in the Model section of the document. The content of this section depends on the model being used (for example, neural network, scorecard, and so on.). The Model section may contain the following sub-sections:

  • Mining Schema: this defines the subset of the fields in the Data Dictionary section used in the model, along with some metadata such as their usage type
  • Output: this defines the model's output fields
  • Targets: this allows post-processing of the output fields of the model.

Just like with any other resource type in Drools, a PMML asset can be compiled in two different ways: declaratively by using a kmodule.xml file or programmatically by using KieHelper or KieBuilder instances.

If we want to include PMML resources in our knowledge bases, our project must declare a dependency on the org.drools:drools-pmml Maven artifact (or include the drools-pmml JAR in the classpath).

The source bundle associated with this chapter contains a simple PMML example (org.drools.devguide.chapter07.PMMLTest) that programmatically creates a KIE Base from a PMML resource.

When a PMML document is compiled by Drools, its components are analyzed and the appropriate combination of rules, declared types, and entry points is created to emulate the calculations performed by the model being processed. The final result will be a set of Drools assets that will mimic the behavior of the original model.

The generated entry points are one possible way to evaluate a predictive model on some input data, facilitating the binding of external data streams (see later for alternative evaluation techniques). The declared types (implementing the base interface org.drools.pmml.pmml_4_2.PMML4Field) hold the current input, internal and output values, together with their metadata (for instance, probabilities and missing/invalid flags); the production rules generate the output values based on the model's evaluation semantics.

Evaluating a PMML model in Drools is slower than evaluating it in a native engine that compiles the model into a sequence of mathematical operations. Drools' goal is not performance at this point, but rather to provide a uniform abstraction of a hybrid KIE Base containing both rule-, and non-rule-based, reasoning elements.

Tip

In a future version, Drools will support both compiled and "explicit" (rule-based) models and the ability to switch between the alternative implementations. An important feature of a rule-based model is that the model's parameters are always asserted as facts in working memory and can be modified by other rules implementing adaptive, online training strategies.

The way we have to specify the value of the input fields of a model in a KIE Session is by using the corresponding entry points that got generated by the PMML compiler. Each of the Mining Fields in the PMML model will create an entry point with the name in_<FIELD_NAME>, where <FIELD_NAME> is the name of the field.

There are three other ways to specify the inputs of a model:

Instantiating the input types directly, as declared types. We covered how to instantiate declared types in Chapter 4, Improving Our Rule Syntax. Now that we know that each input field in the model will generate a declared type, we can instantiate them and insert them into the corresponding session:

  • Binding a Java bean to the model, which contains a field for each entry in the data dictionary.
  • Enabling the declaration of a trait that mimics the data dictionary. Each input field in the model will generate a corresponding Trait class definition with the name <FIELD_NAME>Trait. We can then don an object containing a field for each entry in the data dictionary to feed the model.

After inserting the corresponding fields into a KIE Session, and calling fireAllRules(), the corresponding output fields will be generated inside the session. These output fields are also modeled as declared types. We can then use some of the techniques introduced in Chapter 5, Understanding KIE Sessions to extract these values from the session.

Note

PMML models are stateless. Any change in the input values will be reflected by a change in the outputs. Parallel or persistent evaluation is currently not supported.

In order to get a better understanding of how a PMML is compiled and used, let us implement a very simple model for our customer classification scenario.

Customer classification decision tree example

One of the supported PMML models in Drools is the Decision Tree. A decision tree allows the creation of a tree-like graph, where each node represents a condition that, when evaluated, determines whether the branches under it should also be evaluated. We can refer to Data Mining Group's website for more information about decision trees (http://www.dmg.org/v4-2-1/TreeModel.html).

For our simple classification scenario—where the category of a customer is dictated by his current category and age—the following decision tree would satisfy our requirements:

Customer classification decision tree example

The preceding figure shows the decision tree for our example. We start with a Customer object and the first attribute we evaluate is its category. If the category is already set (that is, is not NA), then a special result is generated by the tree indicating that the current category should not be modified. If the category is NA, the next attribute to be evaluated is the age. For the age, we have our four well known segments. Each of them will generate a different category to be assigned to the customer. A PMML version of this decision tree can be found in the source bundle associated with this chapter (customer-classification-simple.pmml.xml). Let's now analyze the different sections of this PMML file.

Header

The first section is the Header:

<Header description="A simple decision tree model for customer categorization."/>

In this case, this section only defines a description for the document.

DataDictionary

This section defines three fields: previousCategory, age, and result:

<DataDictionary numberOfFields="3">
        <DataField name="previousCategory" optype="categorical" dataType="string">
            <Value value="NA"/>
            <Value value="BRONZE"/>
            <Value value="SILVER"/>
            <Value value="GOLD"/>
        </DataField>
        <DataField name="age" optype="continuous" dataType="integer"/>
        <DataField name="result" optype="categorical" dataType="string">
            <Value value="NO_CHANGE"/>
            <Value value="NA"/>
            <Value value="BRONZE"/>
            <Value value="SILVER"/>
            <Value value="GOLD"/>
        </DataField>
    </DataDictionary>

PMML is not object-oriented; this is why we need to decompose the attributes of our customer into simple elements.

Each of these fields will create a declared type and an entry point when compiled.

Model

In this particular example, the model is defined by a <TreeModel> element. This section is composed of a series of <Node> elements defining the structure of the tree.

This section also defines two important sub-sections: mining schema and output.

The mining schema identifies the set of the fields from the data dictionary used in this model:

<MiningSchema>
<MiningField name="previousCategory"/>
<MiningField name="age"/>
<MiningField name="result" usageType="predicted"/>
</MiningSchema>

In this case, the field result is marked as "predicted". This will tell the model that this field needs to be calculated when the model is executed.

The output section defines the output of the model:

<Output>
       <OutputField name="newCategory" targetField="result" />
</Output>

This output field, mapped to the result mining field, will also generate a declared type in Drools when the model is compiled. Instances of this declared type will contain the result of the execution.

The test included in the source bundle (org.drools.devguide.chapter07.PMMLTest class) uses a separate DRL resource that defines a query to extract the results generated by this model:

query getNewCategory()
    NewCategory($cat: value, valid == true)
end

The query basically filters all the objects of type NewCategory (this is the declared type generated by Drools) that are in a valid state and returns their value.

Once we have a session containing the compiled version of this PMML file, we can use the following code snippet to provide the value of the input fields of the model:

KieSession ksession = //obtain a KIE Session reference.ksession.getEntryPoint("in_PreviousCategory").insert("NA");
ksession.getEntryPoint("in_Age").insert(34);
ksession.fireAllRules();
//execute 'getNewCategory' query to get the result.

As we have previously mentioned, each input field will generate a unique entry point that can be used to set its value. The preceding example sets the value "NA" to the previousCategory field and the value 34 to age. According to our decision tree, the expected result will be SILVER.

PMML troubleshooting

Just like with DSL, decision tables, and rule templates, PMML adds an abstraction level over the DRL that actually gets executed. When something goes wrong, the cause of the error is not always easy to identify.

The good news is that—unlike DSL, decision tables, and rule templates—PMML is an XML-based language that uses well defined schemas for it structure and values (http://www.dmg.org/v4-2-1/pmml-4-2.xsd). Having a schema to validate our documents against will eliminate premature errors caused by a malformed or invalid XML.

But if we still need to know what's going on under the hood when a PMML document is compiled, there is a way to dump the DRL that the compilation process generates. In the unit test associated with this section there is a method called printGeneratedDRL() that does exactly that. This method uses the org.drools.pmml.pmml_4_2.PMML4Compiler class to convert a PMML into DRL. This class is the same class used internally by the KIE Builder when compiling a PMML resource.

PMML limitations

We already mentioned that PMML support in Drools is relatively new, and there are some consequences to this. The first obvious consequence is that not all of the models supported by PMML are currently supported by Drools. This gap should shrink with every new version of the drools-pmml module. Another consequence is that this module has not yet been exposed to a considerable audience. While using this module, expect some rough edges, unsupported features, and even bugs.

When modeling our scenarios using PMML, we have to bear in mind that predictive models are quantitative and process primitive values—continuous or categorical. Coming from an Object Oriented environment such as Java could present some challenges when designing a solution involving PMML documents. As hinted in the model binding discussion, one or more objects can be used to deliver and collect the values of the features used by the model, but the predictive models themselves will not be aware of the domain-specific nature of those values. This is inherent to the nature of predictive models, and developers should not make additional assumptions.

Probably one of the biggest limitations of the current implementation of drools-pmml is the fact that a session can only be used to execute a single instance of a model. If we want to evaluate two customers in our sample tree, we will need to either create two independent KIE Sessions or sequentially evaluate each customer in a single KIE session. The problem with having simultaneous evaluations is that there is no way to identify which result instance corresponds to which set of input parameters.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.174.0