Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Standards and markup languages

As predictive models become more pervasive, the need for sharing the models and completing the modeling process leads to formalization of development process and interchangeable formats. In this section, we'll review two de facto standards, one covering data science processes and the other specifying an interchangeable format for sharing models between applications.

CRISP-DM

Cross Industry Standard Process for Data Mining (CRISP-DM) describing a data mining process commonly used by data scientists in industry. CRISP-DM breaks the data mining science process into the following six major phases:

Business understanding
Data understanding
Data preparation
Modeling
Evaluation
Deployment

In the following diagram, the arrows indicate the process flow, which can move back and forth through the phases. Also, the process doesn't stop with model deployment. The outer arrow indicates the cyclic nature of data science. Lessons learned during the process can trigger new questions and repeat the process while improving previous results:

SEMMA methodology

Another methodology is Sample, Explore, Modify, Model, and Assess (SEMMA). SEMMA describes the main modeling tasks in data science, while leaving aside business aspects such as data understanding and deployment. SEMMA was developed by SAS institute, which is one of the largest vendors of statistical software, aiming to help the users of their software to carry out core tasks of data mining.

Predictive Model Markup Language

Predictive Model Markup Language (PMML) is an XML-based interchange format that allows machine learning models to be easily shared between applications and systems. Supported models include logistic regression, neural networks, decision trees, naïve Bayes, regression models, and many others. A typical PMML file consists of the following sections:

Header containing general information
Data dictionary, describing data types
Data transformations, specifying steps for normalization, discretization, aggregations, or custom functions
Model definition, including parameters
Mining schema listing attributes used by the model
Targets allowing post-processing of the predicted results
Output listing fields to be outputted and other post-processing steps

The generated PMML files can be imported to any PMML-consuming application, such as Zementis Adaptive Decision and Predictive Analytics (ADAPA) and Universal PMML Plug-in (UPPI) scoring engines; Weka, which has built-in support for regression, general regression, neural network, TreeModel, RuleSetModel, and Support Vector Machine (SVM) model; Spark, which can export k-means clustering, linear regression, ridge regression, lasso model, binary logistic model, and SVM; and cascading, which can transform PMML files into an application on Apache Hadoop.

The next generation of PMML is an emerging format called Portable Format for Analytics (PFA), providing a common interface to deploy the complete workflows across environments.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Standards and markup languages

Create new playlist

Sign In

Sign Up

Standards and markup languages

CRISP-DM

SEMMA methodology

Predictive Model Markup Language

Table of Contents for
Standards and markup languages