Weka

Waikato Environment for Knowledge Analysis (WEKA) is a machine learning library that was developed at the University of Waikato, New Zealand, and is probably the most well-known Java library. It is a general purpose library that is able to solve a wide variety of machine learning tasks, such as classification, regression, and clustering. It features a rich graphical user interface, command-line interface, and Java API. You can check out Weka at http://www.cs.waikato.ac.nz/ml/weka/.

At the time of writing this book, Weka contains 267 algorithms in total: data preprocessing (82), attribute selection (33), classification and regression (133), clustering (12), and association rules mining (7). Graphical interfaces are well suited for exploring your data, while the Java API allows you to develop new machine learning schemes and use the algorithms in your applications.

Weka is distributed under the GNU General Public License (GNU GPL), which means that you can copy, distribute, and modify it as long as you track changes in source files and keep it under GNU GPL. You can even distribute it commercially, but you must disclose the source code or obtain a commercial license.

In addition to several supported file formats, Weka features its own default data format, ARFF, to describe data by attribute-data pairs. It consists of two parts. The first part contains a header, which specifies all of the attributes and their types, for instance, nominal, numeric, date, and string. The second part contains the data, where each line corresponds to an instance. The last attribute in the header is implicitly considered the target variable and missing data is marked with a question mark. For example, returning to the example from Chapter 1, Applied Machine Learning Quick Start, the Bob instance written in an ARFF file format would be as follows:

@RELATION person_dataset @ATTRIBUTE `Name` STRING @ATTRIBUTE `Height` NUMERIC @ATTRIBUTE `Eye color`{blue, brown, green} @ATTRIBUTE `Hobbies` STRING @DATA 'Bob', 185.0, blue, 'climbing, sky diving' 'Anna', 163.0, brown, 'reading' 'Jane', 168.0, ?, ?

The file consists of three sections. The first section starts with the @RELATION <String> keyword, specifying the dataset name. The next section starts with the @ATTRIBUTE keyword, followed by the attribute name and type. The available types are STRING, NUMERIC, DATE, and a set of categorical values. The last attribute is implicitly assumed to be the target variable that we want to predict. The last section starts with the @DATA keyword, followed by one instance per line. Instance values are separated by commas and must follow the same order as attributes in the second section.

More Weka examples will be demonstrated in Chapter 3, Basic Algorithms – Classification, Regression, and Clustering, and Chapter 4, Customer Relationship Prediction with Ensembles.

To learn more about Weka, pick up a quick-start book—Weka How-to, by Kaluza, Packt Publishing to start coding, or look into Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations by Witten and Frank, Morgan Kaufmann Publishers for theoretical background and in-depth explanations.

Weka's Java API is organized into the following top-level packages:

weka.associations: These are data structures and algorithms for association rules learning, including Apriori, predictive Apriori, FilteredAssociator, FP-Growth, Generalized Sequential Patterns (GSP), hotSpot, and Tertius.
weka.classifiers: These are supervised learning algorithms, evaluators, and data structures. The package is further split into the following components:
- weka.classifiers.bayes: This implements Bayesian methods, including Naive Bayes, Bayes net, Bayesian logistic regression, and so on.
- weka.classifiers.evaluation: These are supervised evaluation algorithms for nominal and numerical prediction, such as evaluation statistics, confusion matrix, ROC curve, and so on.
- weka.classifiers.functions: These are regression algorithms, including linear regression, isotonic regression, Gaussian processes, Support Vector Machines (SVMs), multilayer perceptron, voted perceptron, and others.
- weka.classifiers.lazy: These are instance-based algorithms such as k-nearest neighbors, K*, and lazy Bayesian rules.
- weka.classifiers.meta: These are supervised learning meta-algorithms, including AdaBoost, bagging, additive regression, random committee, and so on.
- weka.classifiers.mi: These are multiple-instance learning algorithms, such as citation k-nearest neighbors, diverse density, AdaBoost, and others.
- weka.classifiers.rules: These are decision tables and decision rules based on the separate-and-conquer approach, RIPPER, PART, PRISM, and so on.
- weka.classifiers.trees: These are various decision trees algorithms, including ID3, C4.5, M5, functional tree, logistic tree, random forest, and so on.
- weka.clusterers: These are clustering algorithms, including k-means, CLOPE, Cobweb, DBSCAN hierarchical clustering, and FarthestFirst.
- weka.core: These are various utility classes such as the attribute class, statistics class, and instance class.
- weka.datagenerators: These are data generators for classification, regression, and clustering algorithms.
- weka.estimators: These are various data distribution estimators for discrete/nominal domains, conditional probability estimations, and so on.
- weka.experiment: These are a set of classes supporting necessary configuration, datasets, model setups, and statistics to run experiments.
- weka.filters: These are attribute-based and instance-based selection algorithms for both supervised and unsupervised data preprocessing.
- weka.gui: These are graphical interface implementing explorer, experimenter, and knowledge flow applications. The Weka Explorer allows you to investigate datasets, algorithms, as well as their parameters, and visualize datasets with scatter plots and other visualizations. The Weka Experimenter is used to design batches of experiments, but it can only be used for classification and regression problems.The Weka KnowledgeFlow implements a visual drag-and-drop user interface to build data flows and, for example, load data, apply filter, build classifier, and evaluate it.

Table of Contents for Weka

Create new playlist

Sign In

Sign Up

Table of Contents for
Weka