Appendix B

The WEKA workbench

The WEKA workbench is a collection of machine learning algorithms and data preprocessing tools that includes virtually all the algorithms described in this book. It is designed so that you can quickly try out existing methods on new datasets in flexible ways. It provides extensive support for the whole process of experimental data mining, including preparing the input data, evaluating learning schemes statistically, and visualizing the input data and the result of learning. As well as a wide variety of learning algorithms, it includes a wide range of preprocessing tools. This diverse and comprehensive toolkit is accessed through a common interface so that its users can compare different methods and identify those that are most appropriate for the problem at hand.

WEKA was developed at the University of Waikato in New Zealand; the name stands for Waikato Environment for Knowledge Analysis. Outside the university the WEKA, pronounced to rhyme with Mecca, is a flightless bird with an inquisitive nature found only on the islands of New Zealand. The system is written in Java and distributed under the terms of the GNU General Public License. It runs on almost any platform and has been tested under Linux, Windows, and Macintosh operating systems.

B.1 What’s in WEKA?

WEKA provides implementations of learning algorithms that you can easily apply to your dataset. It also includes a variety of tools for transforming datasets, such as the algorithms for discretization and sampling. You can preprocess a dataset, feed it into a learning scheme, and analyze the resulting classifier and its performance—all without writing any program code at all.

The workbench includes methods for the main data mining problems: regression, classification, clustering, association rule mining, and attribute selection. Getting to know the data is an integral part of the work, and many data visualization facilities and data preprocessing tools are provided. All algorithms take their input in the form of a single relational table that can be read from a file or generated by a database query.

One way of using WEKA is to apply a learning method to a dataset and analyze its output to learn more about the data. Another is to use learned models to generate predictions on new instances. A third is to apply several different learners and compare their performance in order to choose one for prediction. In the interactive WEKA interface you select the learning method you want from a menu. Many methods have tunable parameters, which you access through a property sheet or object editor. A common evaluation module is used to measure the performance of all classifiers.

Implementations of actual learning schemes are the most valuable resource that WEKA provides. But tools for preprocessing the data, called filters, come a close second. Like classifiers, you select filters from a menu and tailor them to your requirements.

How do you use it?

The easiest way to use WEKA is through a graphical user interface called the Explorer. This gives access to all of its facilities using menu selection and form filling. For example, you can quickly read in a dataset from a file and build a decision tree from it. The Explorer guides you by presenting options as forms to be filled out. Helpful tool tips pop up as the mouse passes over items on the screen to explain what they do. Sensible default values ensure that you can get results with a minimum of effort—but you will have to think about what you are doing to understand what the results mean.

There are three other graphical user interfaces to WEKA. The Knowledge Flow interface allows you to design configurations for streamed data processing. A fundamental disadvantage of the Explorer is that it holds everything in main memory—when you open a dataset, it immediately loads it all in. That means that it can only be applied to small- to medium-sized problems. However, WEKA contains some incremental algorithms that can be used to process very large datasets. The Knowledge Flow interface lets you drag boxes representing learning algorithms and data sources around the screen and join them together into the configuration you want. It enables you to specify a data stream by connecting components representing data sources, preprocessing tools, learning algorithms, evaluation methods, and visualization modules. If the filters and learning algorithms are capable of incremental learning, data will be loaded and processed incrementally.

WEKA’s third interface, the Experimenter, is designed to help you answer a basic practical question when applying classification and regression techniques: Which methods and parameter values work best for the given problem? There is usually no way to answer this question a priori, and one reason we developed the workbench was to provide an environment that enables WEKA users to compare a variety of learning techniques. This can be done interactively using the Explorer. However, the Experimenter allows you to automate the process by making it easy to run classifiers and filters with different parameter settings on a corpus of datasets, collect performance statistics, and perform significance tests. Advanced users can employ the Experimenter to distribute the computing load across multiple machines using Java remote method invocation. In this way you can set up large-scale statistical experiments and leave them to run.

The fourth interface, called the Workbench, is a unified graphical user interface that combines the other three (and any plugins that the user has installed) into one application. The Workbench is highly configurable, allowing the user to specify which applications and plugins will appear, along with settings relating to them.

Behind these interactive interfaces lies the basic functionality of WEKA. This can be accessed in raw form by entering textual commands, which gives access to all features of the system. When you fire up WEKA you have to choose among five different user interfaces via the WEKA GUI Chooser: the Explorer, Knowledge Flow, Experimenter, Workbench, and command-line interfaces (we do not consider the command-line interface in this Appendix). Most people choose the Explorer, at least initially.

What else can you do?

An important resource when working with WEKA is the online documentation, which has been automatically generated from the source code and concisely reflects its structure. The online documentation gives the only complete list of available algorithms because WEKA is continually growing and—being generated automatically from the source code—the online documentation is always up to date. Moreover, it becomes essential if you want to proceed to the next level and access the library from your own Java programs or write and test learning schemes of your own.

In most data mining applications, the machine learning component is just a small part of a far larger software system. If you intend to write a data mining application, you will want to access the programs in WEKA from inside your own code. By doing so, you can solve the machine learning subproblem of your application with a minimum of additional programming.

If you intend to become an expert in machine learning algorithms (or, indeed, if you already are one), you will probably want to implement your own algorithms without having to address such mundane details as reading the data from a file, implementing filtering algorithms, or providing code to evaluate the results. If so, we have good news for you: WEKA already includes all this. To make full use of it, you must become acquainted with the basic data structures.

An extended version of this Appendix, which discusses these opportunities for advanced users and also describes the command-line interface, is available at

B.2 The package management system

The WEKA software has evolved considerably since the third edition of this book was published. Many new algorithms and features have been added to the system, a number of which have been contributed by the community. With so many algorithms on offer we felt that the software could be considered overwhelming to the new user. Therefore a number of algorithms and community contributions were removed and placed into plugin packages. A package management system was added that allows the user to browse for, and selectively install, packages of interest.

Another motivation for introducing the package management system was to make the process of contributing to the WEKA software easier, and to ease the maintenance burden on the WEKA development team. A contributor of a plugin package is responsible for maintaining its code and hosting the installable archive, while WEKA simply tracks the package metadata. The package system also opens the door to the use of third-party libraries, something that we would have discouraged in the past in order to keep a lightweight footprint for WEKA.

The graphical package manager can be accessed from the Tools menu of WEKA’s GUI Chooser. The very first time the package manager is accessed it will download information about the currently available packages. This requires an internet connection, however, once the package metadata has been downloaded it is possible to use the package manager to browse package information while offline. Of course, an Internet connection is still required to be able to actually install a package.

The package manager presents a list of packages near the top of its window and a panel at the bottom that displays information on the currently selected package in the list. The user can choose to display packages that are available but not yet installed, only packages that are installed, or all packages. The list presents the name of each package, the broad category that it belongs to, the version currently installed (if any), the most recent version of the package available that is compatible with the version of WEKA being used, and a field that, for installed packages, indicates whether the package has been loaded successfully by WEKA or not. Although not obvious at first glance, it is possible to install older versions of a particular package. The Repository version field in the list is actually a drop-down box. The list of packages can be sorted, in ascending or descending order, by clicking on either the package or category column header.

The information panel at the bottom of the window has clickable links for each version of a given package. “Latest” always refers to the latest version of the package, and is the same as the highest version number available. Clicking one of these links displays further information, such as the author of the package, its license, where the installable archive is located, and its dependencies. The information about each package is also browsable at the Web location where WEKA’s package metadata is hosted. All packages have at least one dependency listed—the minimum version of the core WEKA system that they can work with. Some packages list further dependencies on other packages. For example, the multi-InstanceLearning package depends on the multi-InstanceFilters package. When installing multi-InstanceLearning, and assuming that multi-InstanceFilters is not already installed, the system will inform the user the multi-InstanceFilters is required and will be installed automatically.

The package manager displays what are known as official packages for WEKA. These are packages that have been submitted to the WEKA team for a review and have had their metadata added to the official central metadata repository. For one reason or another, an author of a package might decide to make it available in an unofficial capacity. These packages do not appear in the official list on the Web, or in the list displayed by the graphical package manager. If the user knows the URL to an archive containing an unofficial package, it can be installed by using the button in the upper right-hand corner of the package manager window.

Whenever a new package, or new version of an existing one, becomes available the package manager informs the user by displaying a large yellow warning icon. Hovering over this icon displays a tool-tip popup that lists the new packages and prompts the user to click the Refresh repository cache button. Clicking this button downloads a fresh copy of all the package information to the user’s computer.

The Install and Uninstall buttons at the top of the package manager’s window do exactly as their names suggest. More than one package can be installed or uninstalled in one go by selecting multiple entries in the list. By default, WEKA attempts to load all installed packages, and if a package cannot be loaded for some reason a message will be displayed in the Loaded column of the list. The user can opt to prevent a particular package from being loaded by selecting it and then clicking the Toggle load button. This will mark the package as one that should not be loaded the next time that WEKA is started. This can be useful if an unstable package is generating errors, conflicting with another package (perhaps due to third-party libraries), or otherwise preventing WEKA from operating properly.

B.3 The Explorer

WEKA’s historically most popular graphical user interface, the Explorer, gives access to all its facilities using menu selection and form filling. To begin, there are six different panels, selected by the tabs at the top, corresponding to the various data mining tasks that WEKA supports. Further panels can become available by installing appropriate packages.

Loading the data into the Explorer

To illustrate what can be done with the Explorer, suppose we want to build a decision tree from the weather data included in the WEKA download. Fire up WEKA to get the GUI Chooser. Select Explorer from the five choices on the right-hand side. (The others were mentioned earlier: Simple CLI is the old-fashioned command-line interface.)

What you see next is the main Explorer screen. The six tabs along the top are the basic operations that the Explorer supports: right now we are on Preprocess. Click the Open file button to bring up a standard dialog through which you can select a file. Choose the weather.arff file. If you have it in CSV format, change from ARFF data files to CSV data files.

Having loaded the file, the Preprocess screen tells you about the dataset: it has 14 instances and 5 attributes (center left); the attributes are called outlook, temperature, humidity, windy, and play (lower left). The first attribute, outlook, is selected by default (you can choose others by clicking them) and has no missing values, three distinct values, and no unique values; the actual values are sunny, overcast, and rainy and they occur five, four, and five times, respectively (center right). A histogram at the lower right shows how often each of the two values of the class, play, occurs for each value of the outlook attribute. The attribute outlook is used because it appears in the box above the histogram, but you can draw a histogram of any other attribute instead. Here play is selected as the class attribute; it is used to color the histogram, and any filters that require a class value use it too.

The outlook attribute is nominal. If you select a numeric attribute, you see its minimum and maximum values, mean, and standard deviation. In this case the histogram will show the distribution of the class as a function of this attribute.

You can delete an attribute by clicking its checkbox and using the Remove button. All selects all the attributes, None selects none, Invert inverts the current selection, and Pattern selects those attributes whose names match a user-supplied regular expression. You can undo a change by clicking the Undo button. The Edit button brings up an editor that allows you to inspect the data, search for particular values and edit them, and delete instances and attributes. Right-clicking on values and column headers brings up corresponding context menus.

Building a decision tree

To build a decision tree, click the Classify tab to get access to WEKA’s classification and regression schemes. In the Classify panel, select the classifier by clicking the Choose button at the top left, opening up the trees section of the hierarchical menu that appears, and finding J48. The menu structure represents the organization of the WEKA code into modules and the items you need to select are always at the lowest level. Once selected, J48 appears in the line beside the Choose button, along with its default parameter values. If you click that line, the J48 classifier’s object editor opens up and you can see what the parameters mean and alter their values if you wish. The Explorer generally chooses sensible defaults.

Having chosen the classifier, invoke it by clicking the Start button. WEKA works for a brief period—when it is working, the little bird at the lower right of the Explorer jumps up and dances—and then produces the output for J48.

Examining the output

At the beginning of the output is a summary of the dataset, and the fact that 10-fold cross-validation was used to evaluate it. That is the default, and if you look closely at the Classify panel you will see that the Cross-validation box at the left is checked. Then comes a pruned decision tree in textual form. The model that is shown here is always the one generated from the full dataset available from the Preprocess panel.

The next part of the output gives estimates of the tree’s predictive performance. In this case they are obtained using stratified cross-validation with 10 folds. As well as the classification error, the evaluation module also outputs several other performance statistics.

The Classify panel has several other test options: Use training set, which is generally not recommended; Supplied test set, in which you specify a separate file containing the test set; and Percentage split, with which you can hold out a certain percentage of the data for testing. You can output the predictions for each instance by clicking the More options button and checking the appropriate entry. There are other useful options, such as suppressing some output and including other statistics such as entropy evaluation measures and cost-sensitive evaluation.

Working with models

The small pane at the lower left of the Classify panel, which contains one highlighted line, is a history list of the results. The Explorer adds a new line whenever you run a classifier. To return to a previous result set, click the corresponding line and the output for that run will appear in the Classifier Output pane. This makes it easy to explore different classifiers or evaluation schemes and revisit the results to compare them.

When you right-click an entry a menu appears that allows you to view the results in a separate window, or save the result buffer. More importantly, you can save the model that WEKA has generated in the form of a Java object file. You can reload a model that was saved previously, which generates a new entry in the result list. If you now supply a test set, you can reevaluate the old model on that new set.

Several items on the right-click menu allow you to visualize the results in various ways. At the top of the Explorer interface is a separate Visualize tab, but that is different: it shows the dataset, not the results for a particular model. By right-clicking an entry in the history list you can see the classifier errors. If the model is a tree or a Bayesian network you can see its structure. You can also view the margin curve and various cost and threshold curves, and perform a cost/benefit analysis.

Exploring the Explorer

We have briefly investigated two of the six tabs at the top of the Explorer. In summary, here is what all the basic tabs do:

1. Preprocess: Choose the dataset and modify it in various ways.

2. Classify: Train learning schemes that perform classification or regression and evaluate them.

3. Cluster: Learn clusters for the dataset.

4. Associate: Learn association rules for the data and evaluate them.

5. Select attributes: Select the most relevant aspects in the dataset.

6. Visualize: View different two-dimensional plots of the data and interact with them.

Each tab gives access to a whole range of facilities. In our tour so far, we have barely scratched the surface of the Preprocess and Classify panels.

At the bottom of every panel is a Status box and a Log button. The status box displays messages that keep you informed about what is going on. For example, if the Explorer is busy loading a file, the status box will say so. Right-clicking anywhere inside this box brings up a little menu with two options: display the amount of memory available to WEKA, and run the Java garbage collector. Note that the garbage collector runs constantly as a background task anyway.

Clicking the Log button opens a textual log of the actions that WEKA has performed in this session, with timestamps.

As noted earlier, the little bird at the lower right of the window jumps up and dances when WEKA is active. The number beside the × shows how many concurrent processes are running. If the bird is standing but stops moving, it is sick! Something has gone wrong, and you may have to restart the Explorer.

Loading and filtering files

Along the top of the Preprocess panel are buttons for opening files, URLs, and databases. Initially, only files whose names end in .arff appear in the file browser; to see others, change the Format item in the file selection box.

Data can be saved in various formats using the Save button in the Preprocess panel. It is also possible to generate artificial data using the Generate button. Apart from loading and saving datasets, the Preprocess panel also allows you to filter them. Clicking Choose (near the top left) in the Preprocess panel gives a list of filters. We will describe how to use a simple filter to delete specified attributes from a dataset, in other words, to perform manual attribute selection. The same effect can be achieved more easily by selecting the relevant attributes using the tick boxes and pressing the Remove button. Nevertheless we describe the equivalent filtering operation explicitly, as an example.

Remove is an unsupervised attribute filter, and to see it you must first expand the unsupervised category and then the attribute category. This will reveal quite a formidable list of filters, and you will have to scroll further down to find Remove. When selected, it appears in the line beside the Choose button, along with its parameter values—in this case the line reads simply “Remove.” Click that line to bring up a generic object editor with which you can examine and alter the filter’s properties.

To learn about it, click More button. This explains that the filter removes a range of attributes from the dataset. It has an option, attributeIndices, that specifies the range to act on and another called invertSelection that determines whether the filter selects attributes or deletes them. There are boxes for both of these in the object editor. After configuring an object it is often worth glancing at the resulting command-line formulation that the Explorer sets up, which is shown next to the Choose button.

Algorithms in WEKA may provide information about what data characteristics they can handle, and, if they do, a Capabilities button appears underneath More in the generic object editor. Clicking it brings up information about what the method can do. In this case it states that Remove can handle many attribute characteristics, such as different types (nominal, numeric, relational, etc.) and missing values. It shows the minimum number of instances that are required for Remove to operate on.

A list of selected constraints on capabilities can be obtained by clicking the Filter button at the bottom of the generic object editor. If the current dataset exhibits some characteristic that is ticked in this list but missing from the capabilities for the Remove filter the Apply button to the right of Choose in the Preprocess panel will be grayed out, as will the entry in the list that appears when the Choose button is pressed. Although you cannot apply it, you can nevertheless select a grayed-out entry to inspect its options, documentation, and capabilities using the generic object editor. You can release individual constraints by deselecting them in the constraints list, or click the Remove filter button to clear all the constraints.

Clustering and association rules

Use the Cluster and Associate panels to invoke clustering algorithms and methods for finding association rules. When clustering, WEKA shows the number of clusters and how many instances each cluster contains. For some algorithms the number of clusters can be specified by setting a parameter in the object editor. For probabilistic clustering methods, WEKA measures the log-likelihood of the clusters on the training data: the larger this quantity, the better the model fits the data. Increasing the number of clusters normally increases the likelihood, but may overfit.

The controls on the Cluster panel are similar to those for Classify. You can specify some of the same evaluation methods—use training set, supplied test set, and percentage split (the last two are used with the log-likelihood). A further method, classes to clusters evaluation, compares how well the chosen clusters match a preassigned class in the data. You select an attribute (which must be nominal) that represents the “true” class. Having clustered the data, WEKA determines the majority class in each cluster and prints a confusion matrix showing how many errors there would be if the clusters were used instead of the true class. If your dataset has a class attribute, you can ignore it during clustering by selecting it from a pull-down list of attributes, and see how well the clusters correspond to actual class values. Finally, you can choose whether or not to store the clusters for visualization. The only reason not to do so is to conserve space. As with classifiers, you visualize the results by right-clicking on the result list, which allows you to view two-dimensional scatter plots. If you have chosen classes to clusters evaluation, the class assignment errors are shown. For the Cobweb clustering scheme, you can also visualize the tree.

The Associate panel is simpler than Classify or Cluster. WEKA contains several algorithms for determining association rules, but no methods for evaluating such rules.

Attribute selection

The Select attributes panel gives access to several methods for attribute selection. These involve an attribute evaluator and a search method. Both are chosen in the usual way and configured with the object editor. You must also decide which attribute to use as the class. Attribute selection can be performed using the full training set or using cross-validation. In the latter case it is done separately for each fold, and the output shows how many times—i.e., in how many of the folds—each attribute was selected. The results are stored in the history list. When you right-click an entry here you can visualize the dataset in terms of the selected attributes (choose Visualize reduced data).


The Visualize panel helps you visualize a dataset—not the result of a classification or clustering model, but the dataset itself. It displays a matrix of two-dimensional scatter plots of every pair of attributes. You can select an attribute—normally the class—for coloring the data points using the controls at the bottom. If it is nominal, the coloring is discrete; if it is numeric, the color spectrum ranges continuously from blue (low values) to orange (high values). Data points with no class value are shown in black. You can change the size of each plot, the size of the points, and the amount of jitter, which is a random displacement applied to X and Y values to separate points that lie on top of one another. Without jitter, a thousand instances at the same data point would look just the same as one instance. You can reduce the size of the matrix of plots by selecting certain attributes, and you can subsample the data for efficiency. Changes in the controls do not take effect until the Update button is clicked.

Clicking one of the plots in the matrix enlarges it. You can zoom in on any area of the resulting panel by choosing Rectangle from the menu near the top right and dragging out a rectangle on the viewing area like that shown. The Submit button near the top left rescales the rectangle into the viewing area.

Filtering algorithms

Now we take a closer look at the filtering algorithms implemented within WEKA. There are two kinds of filter: unsupervised and supervised. This seemingly innocuous distinction masks a rather fundamental issue. Filters are often applied to a training dataset and then also applied to the test file. If the filter is supervised—e.g., if it uses class values to derive good intervals for discretization—applying it to the test data will bias the results. It is the discretization intervals derived from the training data that must be applied to the test data. When using supervised filters you must be careful to ensure that the results are evaluated fairly, an issue that does not generally arise with unsupervised filters.

Because of popular demand, WEKA allows you to invoke supervised filters as a preprocessing operation, just like unsupervised filters. However, if you intend using them for classification you should adopt a different methodology. A metalearner is provided in the Classify panel that invokes a filter in a way that wraps the learning algorithm into the filtering mechanism. This filters the test data using the filter that has been created by the training data. It is also useful for some unsupervised filters. For example, in WEKA’s StringToWordVector filter the dictionary will be created from the training data alone: words that are novel in the test data will be discarded. To use a supervised filter in this way, invoke the FilteredClassifier metalearning scheme from in the meta section of the menu displayed by the Classify panel’s Choose button.

Within each type there is a further distinction between attribute filters, which work on the attributes in the datasets, and instance filters, which work on the instances. To learn more about a particular filter, select it in the WEKA Explorer and look at its associated object editor, which defines what the filter does and the parameters it takes.

Learning algorithms

On the Classify panel, when you select a learning algorithm using the Choose button the command-line version of the classifier appears in the line beside the button, including the parameters specified with minus signs. To change them, click that line to get an appropriate object editor. The classifiers in WEKA are divided into Bayesian classifiers, trees, rules, functions, lazy classifiers, meta classifiers, and a final miscellaneous category.

Metalearning algorithms take classifiers and turn them into more powerful learners, or retarget them for other applications. They are used to perform boosting, bagging, cost-sensitive classification and learning, automatic parameter optimization, and many other tasks. We already mentioned FilteredClassifier: it runs a classifier on data that has been passed through a filter, which is a parameter. The filter’s own parameters are based exclusively on the training data, which is the appropriate way to apply a supervised filter to test data.

Attribute selection

Attribute selection can be performed in the Explorer’s Select attributes tab. It is normally done by searching the space of attribute subsets, evaluating each one. A potentially faster but less accurate approach is to evaluate the attributes individually and sort them, discarding attributes that fall below a chosen cut-off point. WEKA supports both methods.

Subset evaluators take a subset of attributes and return a numerical measure that guides the search. They are configured like any other WEKA object. Single-attribute evaluators are used with the Ranker search method to generate a ranked list from which Ranker discards a given number.

Search methods traverse the attribute space to find a good subset. Quality is measured by the chosen attribute subset evaluator. Each search method can be configured with WEKA’s object editor, just like evaluator objects.

B.4 The Knowledge Flow Interface

With the Knowledge Flow interface, users select WEKA components from a tool bar, place them on a layout canvas, and connect them into a directed graph that processes and analyzes data. It provides an alternative to the Explorer for those who like thinking in terms of how data flows through the system. It also allows the design and execution of configurations for streamed data processing, which the Explorer cannot do. You invoke the Knowledge Flow interface by selecting KnowledgeFlow from the choices on the GUIChooser.

Getting started

Let us examine a step-by-step example that loads a data file and performs a cross-validation using the J48 decision tree learner. First create a source of data by expanding the DataSources folder in the Design palette on the left-hand side of the Knowledge Flow and select ARFFLoader. The mouse cursor changes to crosshairs to signal that you should next place the component. Do this by clicking anywhere on the canvas, whereupon a copy of the ARFF loader icon appears there. To connect it to an ARFF file, right-click it to bring up a pop-up menu and then click Configure to get an editor dialog. From here you can either browse for an ARFF file by clicking the Browse button, or type the path to one in the Filename field.

Now we specify which attribute is the class using a ClassAssigner object. This is found under the Evaluation folder in the Design palette, so expand the Evaluation folder, select the ClassAssigner, and place it on the canvas. To connect the data source to the class assigner, right-click the data source icon and select dataset from the menu. A rubber-band line appears. Move the mouse over the class assigner component and left-click. A red line labeled dataset appears, joining the two components. Having connected the class assigner, choose the class by right-clicking it, selecting Configure, and entering the location of the class attribute.

We will perform cross-validation on the J48 classifier. In the data flow model, we first connect the CrossValidationFoldMaker to create the folds on which the classifier will run, and then pass its output to an object representing J48. CrossValidationFoldMaker is in the Evaluation folder. Select it, place it on the canvas, and connect it to the class assigner by right-clicking the latter and selecting dataset from the menu. Next select J48 from the trees folder under the Classifiers folder and place a J48 component on the canvas. Connect J48 to the cross-validation fold maker in the usual way, but make the connection twice by first choosing trainingSet and then testSet from the pop-up menu for the cross-validation fold maker. The next step is to select a ClassifierPerformanceEvaluator from the Evaluation folder and connect J48 to it by selecting the batchClassifier entry from the pop-up menu for J48. Finally, from the Visualization folder we place a TextViewer component on the canvas. Connect the classifier performance evaluator to it by selecting the text entry from the pop-up menu for the performance evaluator.

The flow of execution is started by clicking one of the two triangular-shaped “play” buttons at the left side of the main toolbar. The leftmost play button launches all data sources present in the flow in parallel; the other play button launches the data sources sequentially, where a particular order of execution can be specified by including a number at the start of the component’s name (a name can be set via the Set name entry on popup menu). For a small dataset things happen quickly. Progress information appears in the status area at the bottom of the interface. The entries in the status area show the progress of each step in the flow, along with their parameter settings (for learning schemes) and elapsed time. Any errors that occur in a processing step are shown in the status area by highlighting the corresponding row in red. Choosing Show results from the text viewer’s pop-up menu brings the results of cross-validation up in a separate window, in the same form as for the Explorer.

To complete the example, we can add a GraphViewer and connect it to J48’s graph output to see a graphical representation of the trees produced for each fold of the cross-validation. Once you have redone the cross-validation with this extra component in place, selecting Show results from its pop-up menu produces a list of trees, one for each cross-validation fold. By creating cross-validation folds and passing them to the classifier, the Knowledge Flow model provides a way to hook into the results for each fold.

The flow that we have just considered is actually available (minus the GraphViewer) as a built-in template. Example templates can be accessed from the Template button, which is the third icon from the right in the toolbar at the top of the Knowledge Flow interface. There are a number of templates that come with WEKA, and certain packages, once installed via the package manager, add further ones to the menu. The majority of template flows can be executed without further modification as they have been configured to load datasets that come with the WEKA distribution.

Knowledge Flow components

Most of the Knowledge Flow components will be familiar from the Explorer. The Classifiers folder contains all of WEKA’s classifiers, the Filters folder contains the filters, the Clusterers folder holds the clusterers, the AttSelection folder contains evaluators and search methods for attribute selection, and the Associations panel holds the association rule learners. All components in the Knowledge Flow are run in a separate thread of execution, except in the case where data is being processed incrementally—in this case a single thread of execution is used because, generally, the amount of processing done per data point is small, and launching a separate thread to process each one would incur a significant overhead.

Configuring and connecting the components

You establish the knowledge flow by configuring the individual components and connecting them up. The menus that are available by right-clicking various component types have up to three sections: Edit, Connections, and Actions. The Edit operations delete components and open up their configuration panel. You can give a component a name by choosing Set name from the pop-up menu. Classifiers and filters are configured just as in the Explorer. Data sources are configured by opening a file (as we saw previously) or by setting a database connection, and evaluation components are configured by setting parameters such as the number of folds for cross-validation. The Connections operations are used to connect components together by selecting the type of connection from the source component and then clicking on the target object. Not all targets are suitable; applicable ones are highlighted. Items on the connections menu are disabled (grayed out) until the component receives other connections that render them applicable.

There are two kinds of connection from data sources: dataset connections and instance connections. The former are for batch operations such as classifiers like J48; the latter are for stream operations such as NaiveBayesUpdateable (an incremental version of the Naïve Bayes classifier). A data source component cannot provide both types of connection: once one is selected, the other is disabled. When a dataset connection is made to a batch classifier, the classifier needs to know whether it is intended to serve as a training set or a test set. To do this, you first make the data source into a test or training set using the TestSetMaker or TrainingSetMaker components from the Evaluation panel. On the other hand, an instance connection to an incremental classifier is made directly: there is no distinction between training and testing because the instances that flow update the classifier incrementally. In this case a prediction is made for each incoming instance and incorporated into the test results; then the classifier is trained on that instance. If you make an instance connection to a batch classifier it will be used as a test instance because training cannot possibly be incremental whereas testing always can be. Conversely, it is quite possible to test an incremental classifier in batch mode using a dataset connection.

Connections from a filter component are enabled when it receives input from a data source, whereupon follow-on dataset or instance connections can be made. Instance connections cannot be made to supervised filters or to unsupervised filters that cannot handle data incrementally (such as Discretize). To get a test or training set out of a filter, you need to put the appropriate kind in.

The classifier menu has two types of connection. The first type, namely, graph and text connections, provides graphical and textual representations of the classifier’s learned state and is only activated when it receives a training set input. The other type, namely, batchClassifier and incrementalClassifier connections, makes data available to a performance evaluator and is only activated when a test set input is present too. Which one is activated depends on the type of the classifier.

Evaluation components are a mixed bag. TrainingSetMaker and TestSetMaker turn a dataset into a training or test set. CrossValidationFoldMaker turns a dataset into both a training set and a test set. ClassifierPerformanceEvaluator generates textual and graphical output for visualization components. Other evaluation components operate like filters: they enable follow-on dataset, instance, training set, or test set connections depending on the input (e.g., ClassAssigner assigns a class to a dataset). Visualization components do not have connections, although some have actions such as Show results and Clear results.

Incremental learning

In most respects the Knowledge Flow interface is functionally similar to the Explorer: you can do similar things with both. It does provide some additional flexibility—e.g., you can see the tree that J48 makes for each cross-validation fold. But its real strength is the potential for incremental operation.

If all components connected up in the Knowledge Flow interface operate incrementally, so does the resulting learning system. It does not read in the dataset before learning starts, as the Explorer does. Instead, the data source component reads the input instance by instance and passes it through the Knowledge Flow chain.

Selecting the “Learn and evaluate Naive Bayes incrementally” template from the templates menu brings up a configuration that works incrementally. An instance connection is made from the loader to a class assigner component, which, in turn, is connected to the updatable Naïve Bayes classifier. The classifier’s text output is taken to a viewer that gives a textual description of the model. Also, an incrementalClassifier connection is made to the corresponding performance evaluator. This produces an output of type chart, which is piped to a strip chart visualization component to generate a scrolling data plot.

This particular Knowledge Flow configuration can process input files of any size, even ones that do not fit into the computer’s main memory. However, it all depends on how the classifier operates internally. For example, although they are incremental, many instance-based learners store the entire dataset internally.

B.5 The Experimenter

The Explorer and Knowledge Flow environments help you determine how well machine learning schemes perform on given datasets. But serious investigative work involves substantial experiments—typically running several learning schemes on different datasets, often with various parameter settings—and these interfaces are not really suitable for this. The Experimenter enables you to set up large-scale experiments, start them running, leave them, and come back when they have finished and analyze the performance statistics that have been collected. They automate the experimental process. The statistics can be stored in a file or database, and can themselves be the subject of further data mining. You invoke this interface by selecting Experimenter from the choices at the side of the GUIChooser.

Whereas the Knowledge Flow transcends limitations of space by allowing machine learning runs that do not load in the whole dataset at once, the Experimenter transcends limitations of time. It contains facilities for advanced users to distribute the computing load across multiple machines using Java RMI. You can set up big experiments and just leave them to run.

Getting started

As an example, we will compare the J48 decision tree method with the baseline methods OneR and ZeroR on the Iris dataset. The Experimenter has three panels: Setup, Run, and Analyze. To configure an experiment, first click New (toward the right at the top) to start a new experiment (the other two buttons in that row save an experiment and open a previously saved one). Then, on the line below, select the destination for the results—in this case the file Experiment1—and choose CSV file. Underneath, select the datasets—we have only one, the Iris data. To the right of the datasets, select the algorithms to be tested—we have three. Click Add new to get a standard WEKA object editor from which you can choose and configure a classifier. Repeat this operation to add the three classifiers. Now the experiment is ready.

The other settings are all default values. If you want to reconfigure a classifier that is already in the list, you can use the Edit selected button. You can also save the options for a particular classifier in XML format for later reuse. You can right-click on an entry to copy the configuration to the clipboard, and add or enter a configuration from the clipboard.

Running an experiment

To run the experiment, click the Run tab, which brings up a panel that contains a Start button (and little else); click it. A brief report is displayed when the operation is finished. The file Experiment1.csv contains the results, in CSV format, which can be directly read into a spreadsheet. Each row represents 1-fold of a 10-fold cross-validation (see the Fold column). The cross-validation is run 10 times (the Run column) for each classifier (the Scheme column). Thus the file contains 100 rows for each classifier, which makes 300 rows in all (plus the header row). Each row contains plenty of information, including the options supplied to the machine learning scheme; the number of training and test instances; the number (and percentage) of correct, incorrect, and unclassified instances; the mean absolute error, root mean-squared error, and many more.

There is a great deal of information in the spreadsheet, but it is hard to digest. In particular, it is not easy to answer the question posed previously: How does J48 compare with the baseline methods OneR and ZeroR on this dataset? For that we need the Analyze panel.

Analyzing the results

The reason that we generated the output in CSV format was to allow you to explore the raw data produced by the Experimenter in a spreadsheet. The Experimenter normally produces its output in ARFF format. You can also leave the file name blank, in which case the Experimenter stores the results in a temporary file.

To analyze the experiment that has just been performed, select the Analyze panel and click the Experiment button at the right near the top; otherwise, supply a file that contains the results of another experiment. Then click Perform test (near the bottom on the left). The result of a statistical significance test of the performance of the first learning scheme (J48) versus the other two (OneR and ZeroR) will be displayed in the large panel on the right.

We are comparing the percent correct statistic: this is selected by default as the comparison field shown toward the left of the output. The three methods are displayed horizontally, numbered (1), (2), and (3), as the heading of a little table. The labels for the columns are repeated at the bottom—trees.J48, rules.OneR, and rules.ZeroR—in case there is insufficient space for them in the heading. The inscrutable integers beside the scheme names identify which version of the scheme is being used. They are present by default to avoid confusion among results generated using different versions of the algorithms. The value in brackets at the beginning of the iris row (100) is the number of experimental runs: 10 times 10-fold cross-validation.

The percentage correct is shown for the three schemes: 94.73% for method 1, 92.53% for method 2, and 33.33% for method 3. The symbol placed beside a result indicates that it is statistically better (v) or worse (*) than the baseline scheme—in this case J48—at the specified significance level (0.05, or 5%). The corrected resampled t-test is used here. Here, method 3 is significantly worse than method 1, because its success rate is followed by an asterisk. At the bottom of columns 2 and 3 are counts (x/y/z) of the number of times the scheme was better than (x), the same as (y), or worse than (z) the baseline scheme on the datasets used in the experiment. In this case there is only one dataset; method 2 was equivalent to method 1 (the baseline) once and method 3 was worse than it once. (The annotation (v/ /*) is placed at the bottom of column 1 to help you remember the meanings of the three counts x/y/z.)

The output in the Analyze panel can be saved into a file by clicking the “Save output” button. It is also possible to open a WEKA Explorer window to further analyze the experimental results obtained, by clicking on the “Open Explorer” button.

Advanced setup

The Experimenter has an advanced mode, which is accessed by selecting Advanced from the drop down box near the top of the Setup panel. This enlarges the options available for controlling the experiment—including, e.g., the ability to generate learning curves. However, the advanced mode is hard to use, and the simple version suffices for most purposes. For example, in advanced mode you can set up an iteration to test an algorithm with a succession of different parameter values, but the same effect can be achieved in simple mode by putting the algorithm into the list several times with different parameter values.

One thing you can do in advanced mode but not in simple mode is run experiments using clustering algorithms. Here, experiments are limited to those clusterers that can compute probability or density estimates, and the main evaluation measure for comparison purposes is the log-likelihood. Another use for the advanced mode for is to set up distributed experiments.

The Analyze panel

Our walkthrough used the Analyze panel to perform a statistical significance test of one learning scheme (J48) versus two others (OneR and ZeroR). The test was on the error rate. Other statistics can be selected from the drop-down menu instead, including various entropy figures. Moreover, you can see the standard deviation of the attribute being evaluated by ticking the Show std deviations checkbox.

Use the Test base menu to change the baseline scheme from J48 to one of the other learning schemes. For example, selecting OneR causes the others to be compared with this scheme. Apart from the learning schemes, there are two other choices in the Select base menu: Summary and Ranking. The former compares each learning scheme with every other scheme and prints a matrix whose cells contain the number of datasets on which one is significantly better than the other. The latter ranks the schemes according to the total number of datasets that represent wins (>) and losses (<) and prints a league table. The first column in the output gives the difference between the number of wins and the number of losses.

The Row and Column fields determine the dimensions of the comparison matrix. Clicking Select brings up a list of all the features that have been measured in the experiment. You can select which to use as the rows and columns of the matrix. (The selection does not appear in the Select box because more than one parameter can be chosen simultaneously.)

There is a button that allows you to select a subset of columns to display (the baseline column is always included), and another that allows you to select the output format: plain text (default), output for the LaTeX typesetting system, CSV format, HTML, data and script suitable for input to the GNUPlot graph plotting software, and just the significance symbols in plain text format. It is also possible to show averages and abbreviate filter class names in the output.

There is an option to choose whether to use the paired corrected t-test or the standard t-test for computing significance. The way the rows are sorted in the results table can be changed by choosing the Sorting (asc.) by option from the drop-down box. The default is to use natural ordering, presenting them in the order that the user entered the dataset names in the Setup panel. Alternatively, the rows can be sorted according to any of the measures that are available in the Comparison field.

