Chapter 13

Moving on

applications and beyond

Abstract

Recent years have seen machine learning techniques become prominent in unexpectedly diverse application areas. This chapter introduces a few of them: text mining, including document classification and clustering; web mining, including wrapper induction and the page-rank method used for web search; computer vision, including both object and face recognition; speech recognition; and natural language processing and understanding. Deep learning has made inroads in all these areas and we draw connections to the material covered in Chapter 10, Deep learning. We also consider some other issues that are relevant in practical applications. Applying machine learning to data mining often involves careful choice of learning algorithm and algorithm parameters. Many practical datasets are truly massive and cannot be tackled with standard algorithms designed for small-to-medium size data. In some real-world scenarios, data arrives in a stream, requiring the ability to constantly and quickly update the model and respond to changes in the nature of the data. Often, domain expertise is present in the form of background knowledge that can be used to aid the learning algorithm in finding good concept descriptions. Some applications, particularly in the cyber-security area, involve adversarial situations, where the learning algorithm is confronted with training data that is designed to be misleading. Machine learning techniques are beginning to creep into our daily environment, and we end by glimpsing a future of ubiquitous data mining.

Keywords

Massive datasets; data streams; domain knowledge; text mining; web mining; computer vision; speech recognition; natural language processing; adversarial learning; ubiquitous data mining

Machine learning is a burgeoning new technology for mining knowledge from data, a technology that a lot of people are beginning to take seriously. Looking forward, the main challenge ahead is applications. Opportunities abound. Wherever there is data, things can be learned from it. Whenever there is too much data for people to pore over themselves, the mechanics of learning will have to be automatic. But the inspiration will certainly not be automatic! Applications will come not from computer programs, nor from machine learning experts, nor from the data itself, but from the people who work with the data and the problems from which it arises. That is why we have written this book, and that is what the WEKA system described in Appendix B is for—to empower those who are not machine learning experts to apply these techniques to problems that arise in daily working life. The ideas are simple. The algorithms are here. The rest is really up to you!

Of course, development of the technology is certainly not finished. Machine learning is a hot research topic and new ideas and techniques continually emerge. To give a flavor of the scope and variety of research fronts, we close Part II by looking at some topical areas in the world of data mining.

13.1 Applying Machine Learning

In 2006 the International Data Mining Conference took a poll to identify the top 10 data mining algorithms. Although now somewhat dated, it is still instructive to consider the result, shown in Table 13.1. It is good to see that all the algorithms are covered in this book! The conference organizers divided the algorithms into rough categories, which are also shown. Some of the assignments are rather arbitrary—Naïve Bayes, e.g., is certainly a statistical learning method. Nevertheless the emphasis on classification over other forms of learning, which reflects the emphasis in this book, is evident in the table; as is the prominence of C4.5, which we have also noted. One algorithm in Table 13.1 that has not been mentioned so far is the PageRank algorithm for link mining, which we were a little surprised to see in this list. Section 13.6 contains a brief description.

Table 13.1

The Top 10 Algorithms in Data Mining, According to a 2006 Poll

 Algorithm Category Book Section
1 C4.5 Classification 4.3, 6.1
2 K-means Clustering 4.8
3 SVM Statistical learning 7.2
4 Apriori Association analysis 4.5, 6.3
5 EM Statistical learning 9.3, 9.4
6 PageRank Link mining 13.6
7 Adaboost Ensemble learning 12.4
8 kNN Classification 4.7, 7.1
9 Naïve Bayes Classification 4.2
10 CART Classification 6.1

Image

We have repeatedly stressed that productive use of machine learning is not just a matter of finding some data and then blindly applying learning algorithms to it. Of course, the existence of tools such as the WEKA workbench makes that easy to do—and therein lies a danger. We have seen many publications that seem to follow this methodology: the authors run a plethora of learning algorithms on a particular dataset and then write an article claiming that such-and-such a machine learning method is best for such-and-such a problem—with little apparent understanding of what those algorithms do, the nature of the data, or consideration for statistical significance. The usefulness of such studies is questionable.

A related but rather different issue concerns the improvements in machine learning methods that have been reported over the years. In a 2006 paper provocatively entitled “Classifier technology and the illusion of progress,” David Hand, a prominent statistician and machine learning researcher, points out that a great many algorithms have been devised for supervised classification, and a great many comparative studies have been conducted that apparently establish the superiority of new methods over their predecessors. Yet he contends that the continued steady progress that publication of these studies seems to document is, in fact, to a large extent illusory. This message brings to mind the 1R machine learning scheme some 15 years earlier with which we began Chapter 4, Algorithms: the basic methods, which, as pointed out there, was never really intended as a machine learning “method” but was devised to demonstrate that putting high-powered inductive inference methods to work on simple datasets is like using a sledgehammer to crack a nut. That insight underlies the simplicity-first methodology that pervades this book, of which Hand’s recent paper is a salutary reminder.

How can progress be largely illusory, given documented improvements in measured classification success? The claim is basically that the differences in performance are very small, and in practical applications are likely to be swamped by other sources of uncertainty. There are many reasons for this. Simple methods may not perform as well as complex ones, but they often perform nearly as well. An extremely simple model—always choose the majority class—sets a baseline upon which any learning method should be able to improve. Consider the improvement over the baseline achieved by a simple method as a proportion of the improvement over the baseline achieved by a sophisticated method. For a variety of randomly-chosen datasets, it turns out that a very simple method achieved more than 90% of the improvement yielded by the most sophisticated scheme. This is not so surprising. In standard classification schemes such as decision trees and rules, a huge proportional gain in predictive accuracy is achieved at the beginning of the process when the first branch or rule is determined, and subsequent gains are small—usually very small indeed.

Small improvements are easily swamped by other factors. A fundamental assumption of machine learning is that the training data is representative of the distribution from which future data will be chosen—the assumption is generally that the data is independent and identically distributed (often abbreviated to IID). But in real life, things drift. Yet training data is always retrospective. And it might be quite old. Consider the loan scenario introduced in Section 1.3. To collect a substantial volume of training data (and thorough training needs a substantial volume), we must wait until many loans have been issued. And then we must wait until the end of the loan period (2 years? 5 years?) for the outcome to be known. By the time we use it for training, the data is quite old. And what has changed in the meantime? There are new ways of doing things. The bank has changed the way it defines measurements on which the features are based. New features have become available. Policies have altered. Is that ancient data really representative of today’s problem?

Another fundamental problem is the reliability of the class labels in the training data. There may be small errors—random or even systematic ones—in which case, perhaps, we should stick to simpler models because the higher-order terms of more complex models may be very inaccurate. In determining class labels, someone, somewhere, may be mapping a gray world on to a black-and-white one, which requires judgment and invites inconsistency. And things may change: the notion of a “defaulter” on a loan—say, unpaid bills for 3 months—may be subtly different today than it was before—perhaps, in today’s economic climate, hard-pressed customers will be given another couple of month’s leeway before calling in the bailiffs. The point is not that learning will necessarily fail. The changes may be fairly subtle, and the learned models may still work well. The point is that the extra few percent gained by a sophisticated model over a simple one may be swamped by other factors.

Another issue, when looking at comparative experiments with machine learning methods, is who is doing the driving. It’s not just a matter of firing up the various different methods and recording the results. Many machine learning schemes benefit from tweaking—optimization to fit the problem at hand. Hopefully the data used for tweaking is kept entirely separate from that used for testing (otherwise the results are dishonest). But it is natural that an expert in some particular method—maybe the person who developed it—is able to squeeze more performance out of it than someone else. If they are trying to get their work published, they will certainly want to present the new method in the best possible light. They may not be so experienced at squeezing good performance out of existing, competitive, methods—or so diligent. New methods always look better than old ones; also, more complicated schemes are harder to criticize than simpler ones!

The upshot is that small gains in laboratory performance, even though real, may be swamped by other factors when machine learning is applied to a practical data mining problem. If you want to do something worthwhile on a practical dataset you need to take the entire problem context into account.

Before closing this section, we should note that Hand’s paper was written before the breakthrough results obtained using deep learning. These show that extremely complex models can yield substantial benefits in some applications, if sufficient training data is available and sufficient care is taken when learning them.

13.2 Learning From Massive Datasets

The enormous proliferation of very large databases in today’s companies and scientific institutions makes it necessary for machine learning algorithms to operate on massive datasets. Two separate dimensions become critical when any algorithm is applied to very large datasets: space and time.

Suppose the data is so large that it cannot be held in main memory. This causes no difficulty if the learning scheme works in an incremental fashion, processing one instance at a time when generating the model. An instance can be read from the input file, the model can be updated, the next instance can be read, and so on—without ever holding more than one training instance in main memory. This is “data stream learning,” and we discuss it in the next section. Other methods, such as basic instance-based schemes and locally weighted regression, need access to all the training instances at prediction time. In that case, sophisticated caching and indexing mechanisms have to be employed to keep only the most frequently used parts of a dataset in memory and to provide rapid access to relevant instances in the file.

The other critical dimension when applying learning algorithms to massive datasets is time. If the learning time does not scale linearly (or almost linearly) with the number of training instances, it will eventually become infeasible to process very large datasets. In some applications the number of attributes is a critical factor, and only methods that scale linearly in the number of attributes are acceptable. Alternatively, prediction time might be the crucial issue. Fortunately, there are many learning algorithms that scale gracefully during both training and testing. For example, the training time for Naïve Bayes is linear in both the number of instances and the number of attributes. For top-down decision tree inducers, we saw in Section 6.1 that training time is linear in the number of attributes and, if the tree is uniformly bushy, log-linear in the number of instances (if subtree raising is not used).

When a dataset is too large for a particular learning algorithm to be applied, there are three ways to make learning feasible. The first is trivial: instead of applying the scheme to the full dataset, use just a small subset for training. Of course, information is lost when subsampling is employed. However, the loss may be negligible because the predictive performance of a learned model often flattens out long before all the training data is incorporated into it. If this is the case, it can easily be verified by observing the model’s performance on a holdout test set for training sets of different size.

This kind of behavior, called the law of diminishing returns, may arise because the learning problem is a simple one, so that a small volume of training data is sufficient to learn an accurate model. Alternatively, the learning algorithm might be incapable of grasping the detailed structure of the underlying domain. This is often observed when Naïve Bayes is employed in a complex domain: additional training data may not improve the performance of the model, whereas a decision tree’s accuracy may continue to climb. In this case, of course, if predictive performance is the main objective you should switch to the more complex learning algorithm. But beware of overfitting! Take care not to assess performance on the training data.

Parallelization is another way of reducing the time complexity of learning. The idea is to split the problem into smaller parts, solve each using a separate processor, and combine the results together. To do this, a parallelized version of the learning algorithm must be created. Some algorithms lend themselves naturally to parallelization. Nearest-neighbor methods, e.g., can easily be distributed among several processors by splitting the data into parts and letting each processor find the nearest neighbor in its part of the training set. Decision tree learners can be parallelized by letting each processor build a subtree of the complete tree. Bagging and stacking (although not boosting) are naturally parallel algorithms. However, parallelization is only a partial remedy because with a fixed number of processors, the algorithm’s asymptotic time complexity cannot be improved—although modern graphics cards contain an enormous number of (very simple) processors!

A simple way to apply any algorithm to a large dataset is to split the data into chunks of limited size and learn models separately for each one, combining the result using voting or averaging. Either a parallel bagging-like scheme or a sequential boosting-like scheme can be employed for this purpose. Boosting has the advantage that new chunks can be weighted based on the classifiers learned from previous chunks, thus transferring knowledge between chunks. In both cases memory consumption increases linearly with dataset size; hence some form of pruning is necessary for very large datasets. This can be done by setting aside some validation data and only adding a model from a new chunk to the committee classifier if it increases the committee’s performance on the validation set. The validation set can also be used to identify an appropriate chunk size by running the method with several different chunk sizes in parallel and monitoring performance on the validation set.

The best but most challenging way to enable a learning paradigm to deal with very large datasets would be to develop new algorithms with lower computational complexity. In some cases, it is provably impossible to derive exact algorithms with lower complexity. Decision tree learners that deal with numeric attributes fall into this category. Their asymptotic time complexity is dominated by the sorting process for the numeric attribute values, a procedure that must be performed at least once for any given dataset. However, stochastic algorithms can sometimes be derived that approximate the true solution but require a much smaller amount of time.

Background knowledge can make it possible to vastly reduce the amount of data that needs to be processed by a learning algorithm. Depending on which attribute is the class, most of the attributes in a huge dataset might turn out to be irrelevant when background knowledge is taken into account. As usual, it pays to carefully engineer the data that is passed to the learning scheme and make the greatest possible use of any prior information about the learning problem at hand. If insufficient background knowledge is available, the attribute filtering algorithms described in Section 7.1 can often drastically reduce the amount of data—possibly at the expense of a minor loss in predictive performance. Some of these—e.g., attribute selection using decision trees or the 1R learning scheme—are linear in the number of attributes.

To give a feeling for the volume of data that can be handled by straightforward implementations of machine learning algorithms on ordinary microcomputers, we ran WEKA’s decision tree learner J48, an implementation of C4.5, on a dataset with 5 M instances, 40 attributes (almost all numeric), and a class with 25 values.1 We used a reasonably modern machine2 running a Java virtual machine3 with 6 Gbytes of heap space (half of this was required just to load the data). The resulting tree, which had 1388 nodes, took 18 minutes to build. In general, Java is a little slower than equivalent C/C++ code—but less than twice as slow.

There are datasets today that truly deserve the adjective massive. Scientific datasets from astrophysics, nuclear physics, earth science, and molecular biology are measured in terabytes. So are datasets containing records of financial transactions. Application of standard programs for machine learning to such datasets in their entirety is a very challenging proposition.

13.3 Data Stream Learning

One way of addressing massive datasets is to develop learning algorithms that treat the input as a continuous data stream. In the new paradigm of data stream mining, which has developed during the last decade, algorithms are developed that cope naturally with datasets that are many times the size of main memory—perhaps even indefinitely large. The core assumption is that each instance can be inspected once only (or at most once) and must then be discarded to make room for subsequent instances. The learning algorithm has no control over the order in which instances are processed, and must update its model incrementally as each one arrives. Most models also satisfy the “anytime” property—they are ready to be applied at any point during the learning process. Such algorithms are ideal for real-time learning from data streams, making predictions in real time whilst adapting the model to changes in the evolving input stream. They are typically applied to online learning from data produced by physical sensors.

For such applications, the algorithm must operate indefinitely yet use a limited amount of memory. Even though we have stipulated that instances are discarded as soon as they have been processed, it is obviously necessary to remember at least something about at least some of the instances, otherwise the model would be static. And as time progresses, the model may continue to grow—inexorably. But it must not be allowed to grow without bound. When processing big data, memory is quickly exhausted unless limits are enforced on every aspect of its use. Moving from space to time, algorithms intended for real-time application must process instances faster than they arrive, dealing with each one within a fixed, constant, preferably small, time bound. This does not allow, e.g., for occasional complex reorganizations of a tree model—unless the cost can be amortized over several instances, which introduces a further level of complexity.

Naïve Bayes is a rare example of an algorithm that needs no adaptation to deal with data streams—as long as there are no substantial changes in the stream. Training is incremental: it merely involves updating a fixed set of numeric parameters. Memory usage is small because no structure is added to the model. Other classifiers with the same properties include 1R and the basic perceptron. Multilayer neural nets usually have a fixed structure as well, and as we saw in Section 7.2 stochastic backpropagation updates weights incrementally after each training instance has been processed, rather than in a batch operation, and thus is suitable for online learning. Rules with exceptions make modifications incrementally by expressing exceptions to existing rules rather than reengineering the entire set, and thus could be rendered suitable for data stream learning—although care would need to be taken to ensure that memory usage did not increase inexorably as the number of exceptions increased. Instance-based algorithms and related methods such as locally weighted linear regression are also incremental, but need to be adapted to operate within a fixed memory bound.

To convey the flavor of how a standard algorithm might be adapted for stream processing, we will examine the case of decision trees—which have the advantage of evolving structure in a form that is interpretable. Early work on incremental induction of decision trees devised methods for creating a tree and allowing it to be restructured when sufficient evidence had accumulated that an alternative version would be better. However, a large amount of information needs to be retained to support the restructuring operation—in some cases, all of the training data. Furthermore, restructuring tends to be slow—sometimes slower than recreating the entire tree from scratch. Although interesting, these methods do not support indefinite processing of data streams in real time.

Their problem is that they adopt the usual paradigm of squeezing as much information as possible out of the available instances. With data streams, this is not necessarily appropriate—it is perfectly acceptable to discard some information about the instances, because if it is important it will always reappear. A new paradigm of “Hoeffding trees” was introduced in 2000, which builds models that can be proven equivalent to standard decision trees if the data is static and the number of examples is large enough.

Hoeffding trees are based on a simple idea known as the Hoeffding bound. It makes intuitive sense that, given enough independent observations, the true mean of a random variable will not differ from the estimated mean by more than a certain amount. In fact, the Hoeffding bound states that after n observations and with probability 1–δ, the true mean of a random variable of range R will not be smaller than the estimated mean minus εimage, where

ε=1n(1/δ)2nR.

image

This bound holds regardless of the probability distribution that underlies the values. Being general, it is more conservative than distribution-dependent bounds. Although tighter bounds are known for particular distributions, the Hoeffding formulation works well empirically.

The basic issue in decision tree induction is to choose an attribute to branch on at each stage. To apply the Hoeffding bound, first set a small value of δ (say 10−7), the probability that the choice of attribute will be incorrect. The random variable being estimated is the difference in information gain between the best two attributes, and R is the base two logarithm of the number of possible class labels. For example, if the difference in gain between the best two attributes is estimated to be 0.3, and the above formula yields a value for ε of 0.1, the bound guarantees that the actual difference in gain exceeds 0.2 with high probability, which represents positive separation for the best attribute. Thus it is safe to split.

If the difference in information gain between the best two attributes is less than ε, it is not safe to split. However, ε will decrease as n continues to increase, so it is simply a matter of waiting until more examples have been seen—although, of course, this may alter the estimate of which are the two best attributes and how far apart they are.

This simple test is the core principle of Hoeffding trees: to decide, with probability 1–δ, that a particular attribute exhibits greater information gain than all the others, i.e., the gap between it and its closest competitor exceeds ε. The bound decays rapidly as more examples are seen: e.g., for a two-class problem (R=1) with δ=10−7 it falls below 0.1 after the first 1000 examples and below 0.01 after the first 100,000. One might object that as the number of leaves grows indefinitely, the probability of making incorrect decisions will continually increase even though the probability of error at each one falls below δ. This is true—except that, working within finite memory, the number of leaves cannot grow indefinitely. Given a maximum tree size, keeping the overall probability of error within a given bound is just a matter of choosing an appropriate value for δ. The basic principle can be applied to measures other than the information gain, and to learning methods other than decision trees.

There are many other issues. A tie-breaking strategy is advisable to permit further development of the tree in situations where the top two attributes exhibit very similar information gains. Indeed, the presence of two identical attributes could block any development of the tree at all. To prevent this, nodes should be split whenever the Hoeffding bound falls below a small prespecified tie-breaking parameter, no matter how close the next best option. To increase efficiency the Hoeffding test may be performed periodically for each leaf, after k new instances have reached it, and only when a mix of classes have reached the leaf—otherwise there is no need to split. Prepruning is another simple possibility. The algorithm can incorporate this by also evaluating the merit of not splitting at all: i.e., by splitting only if the best attribute’s information gain at the node exceeds zero. Unlike prepruning in the batch learning setting, this is not a permanent decision: nodes are only prevented from splitting until it appears that a split will be useful.

Now consider memory usage. What must be stored within a leaf are simply counts of the number of times each class label reaches that leaf, for each attribute value. This causes problems for numeric attributes, which require separate treatment. Unsupervised discretization is easy, but supervised prediscretization is trickier. A Gaussian approximation can be made for numeric attributes on a per-class basis and updated using simple incremental update algorithms for mean and variance. To prevent indefinite growth in memory requirements, a strategy must be devised to limit the total number of nodes in the tree. This can be done by deactivating leaves that look insufficiently promising in terms the accuracy gain that further development might yield. The potential gain is bounded by the expected number of mistakes a leaf might make, so this is an obvious candidate for measuring its promise. Leaves can periodically be ordered from most to least promising and deactivated accordingly. A further possibility for saving space is to abandon attributes that seem to be poor predictors and discard their statistics from the model.

Although this section has focused on decision trees for classification, researchers have studied stream-based versions of all the classical data mining problems: regression, clustering, ensemble methods, association rules, and so on. An open-source system called MOA for Massive Online Analysis, closely related to WEKA, contains a collection of online learning algorithms, as well as tools for evaluation.4

13.4 Incorporating Domain Knowledge

Throughout this book we have emphasized the importance of getting to know your data when undertaking practical data mining. Knowledge of the domain is absolutely essential for success. Data about data is often called metadata, and one of the frontiers in machine learning is the development of ways to allow learning methods to take metadata into account in a useful way.

You don’t have to look far for examples of how metadata might be applied. In Chapter 2, Input: concepts, instances, attributes, we divided attributes into nominal and numeric. But we also noted that many finer distinctions are possible. If an attribute is numeric an ordering is implied, but sometimes there is a zero point and sometimes not (for time intervals there is, but for dates there is not). Even the ordering may be nonstandard: angular degrees have a different ordering to integers because 360° is the same as 0° and 180° is the same as −180° or indeed 900°. Discretization schemes assume ordinary linear ordering, as do learning schemes that accommodate numeric attributes, but it would be a routine matter to extend them to circular orderings. Categorical data may also be ordered. Imagine how much more difficult our lives would be if there were no conventional ordering for letters of the alphabet. (Looking up a listing in the Hong Kong telephone directory presents an interesting and nontrivial problem!) And the rhythms of everyday life are reflected in circular orderings: days of the week, months of the year. To further complicate matters there are many other kinds of ordering, such as partial orderings on subsets: subset A may include subset B, or subset B may include subset A, or neither may include the other. Extending ordinary learning schemes to take account of this kind of information in a satisfactory and general way is an active research area.

Metadata often involves relations among attributes. Three kinds of relation can be distinguished: semantic, causal, and functional. A semantic relation between two attributes indicates that if the first is included in a rule, the second should be, too. In this case, it is known a priori that the attributes only make sense together. For example, in agricultural data that we have analyzed, an attribute called milk production measures how much milk an individual cow produces, and the purpose of our investigation meant that this attribute had a semantic relationship with three other attributes, cow-identifier, herd-identifier, and farmer-identifier. In other words, a milk production value can only be understood in the context of the cow that produced the milk, and the cow is further linked to a specific herd owned by a given farmer. Semantic relations are, of course, problem dependent: they depend not just on the dataset but also on what you are trying to do with it.

Causal relations occur when one attribute causes another. In a system that is trying to predict an attribute caused by another, we know that the other attribute should be included to make the prediction more meaningful. For example, in the agricultural data mentioned previously there is a chain from the farmer, herd, and cow identifiers, through measured attributes such as milk production, down to the attribute that records whether a particular cow was retained or sold by the farmer. Learned rules should recognize this chain of dependence.

Functional dependencies occur in many databases, and the people who create databases strive to identify them for the purpose of normalizing the relations in the database. When learning from the data, the significance of a functional dependency of one attribute on another is that if the latter is used in a rule there is no need to consider the former. Learning schemes often rediscover functional dependencies that are already known. Not only does this generate meaningless, or more accurately tautological, rules, but also other, more interesting patterns may be obscured by the functional relationships. However, there has been much work in automatic database design on the problem of inferring functional dependencies from example queries, and the methods developed should prove useful in weeding out tautological rules generated by learning schemes.

Taking these kinds of metadata, or prior domain knowledge, into account when doing induction using any of the algorithms we have met does not seem to present any deep or difficult technical challenges. The only real problem—and it is a big one—is how to express the metadata in a general and easily understandable way so that it can be generated by a person and used by the algorithm.

It seems attractive to couch the metadata knowledge in just the same representation as the machine learning scheme generates. We focus on rules, which are the norm for much of this work. The rules that specify metadata correspond to prior knowledge of the domain. Given training examples, additional rules can be derived by one of the rule induction schemes we have already met. In this way, the system might be able to combine “experience” (from examples) with “theory” (from domain knowledge). It would be capable of confirming and modifying its programmed-in knowledge based on empirical evidence. Loosely put, the user tells the system what he or she knows, gives it some examples, and it figures the rest out for itself!

To make use of prior knowledge expressed as rules in a sufficiently flexible way, it is necessary for the system to be able to perform logical deduction. Otherwise, the knowledge has to be expressed in precisely the right form for the learning algorithm to take advantage of it, which is likely to be too demanding for practical use. Consider causal metadata: if attribute A causes B and B causes C, we would like the system to deduce that A causes C rather than having to state that fact explicitly. Although in this simple example explicitly stating the new fact presents little problem, in practice, with extensive metadata, it will be unrealistic to 4expect users to express all logical consequences of their prior knowledge.

A combination of deduction from prespecified domain knowledge and induction from training examples seems like a flexible way of accommodating metadata. At one extreme, when examples are scarce (or nonexistent), deduction is the prime (or only) means of generating new rules. At the other, when examples are abundant but metadata is scarce (or nonexistent), the standard machine learning techniques described in this book suffice. Practical situations span the territory between.

This is a compelling vision, and methods of inductive logic programming, mentioned in Section 3.4, offer a general way of specifying domain knowledge explicitly through statements in a formal logic language. However, current logic programming solutions suffer serious shortcomings in real-world environments. They tend to be brittle and to lack robustness, and they may be so computation intensive as to be infeasible on datasets of any practical size. Perhaps this stems from the fact that they use first-order logic, i.e., they allow variables to be introduced into the rules. The machine learning schemes we have seen, whose input and output are represented in terms of attributes and constant values, perform their machinations in propositional logic, without variables—greatly reducing the search space and avoiding all sorts of difficult problems of circularity and termination.

Some aspire to realize the vision without the accompanying brittleness and computational infeasibility of full logic programming solutions by adopting simplified reasoning systems. Others place their faith in the general mechanism of Bayesian networks, discussed in Section 9.2, in which causal constraints can be expressed in the initial network structure and hidden variables can be postulated and evaluated automatically. Probabilistic logic learning offers a way to cope with both the complexity and the uncertainty of the real world by combining logic programming with statistical reasoning. It will be interesting to see whether systems that allow flexible specification of different types of domain knowledge will become widely deployed.

13.5 Text Mining

Data mining is about looking for patterns in data. Likewise, text mining is about looking for patterns in text: it is the process of analyzing text to extract information that is useful for particular purposes. Compared with the kind of data we have been talking about in this book, text is unstructured, amorphous, and difficult to deal with. Nevertheless, in modern Western culture, text is the most common vehicle for the formal exchange of information. The motivation for trying to extract information from it is compelling—even if success is only partial.

The superficial similarity between text and data mining conceals real differences. In Chapter 1, What’s it all about?, we characterized data mining as the extraction of implicit, previously unknown, and potentially useful information from data. With text mining, however, the information to be extracted is clearly and explicitly stated in the text. It is not hidden at all—most authors go to great pains to make sure that they express themselves clearly and unambiguously. From a human point of view, the only sense in which it is “previously unknown” is that time restrictions make it infeasible for people to read the text themselves. The problem, of course, is that the information is not couched in a manner that is amenable to automatic processing. Text mining strives to bring it out in a form that is suitable for consumption by computers or by people who do not have time to read the full text.

Both data and text mining seek to extract information that is potentially useful. In one sense, this means actionable—capable of providing a basis for actions to be taken automatically. In the case of data mining, this notion can be expressed in a relatively domain-independent way: actionable patterns are ones that allow nontrivial predictions to be made on new data from the same source. Performance can be measured by counting successes and failures, statistical techniques can be applied to compare different data mining methods on the same problem, and so on. However, in many text mining situations it is hard to characterize what “actionable” means in a way that is independent of the particular domain at hand. This makes it difficult to find fair and objective measures of success.

As we have emphasized throughout this book, “potentially useful” is often given another interpretation in practical data mining: the key for success is that the information extracted must be comprehensible in that it helps to explain the data. This is necessary whenever the result is intended for human consumption rather than (or as well as) a basis for automatic action. This criterion is less applicable to text mining because, unlike data mining, the input itself is comprehensible. Text mining with comprehensible output is tantamount to summarizing salient features from a large body of text, which is a subfield in its own right: text summarization.

Document Classification and Clustering

We have already encountered one important text mining problem: document classification, in which each instance represents a document and the instance’s class is the document’s topic. Documents are characterized by the words that appear in them. The presence or absence of each word can be treated as a Boolean attribute, or documents can be treated as bags of words, rather than sets, by taking word frequencies into account. We encountered this distinction in Section 4.2, where we learned how to extend Naïve Bayes to the bag-of-words representation, yielding the multinomial version of the algorithm.

There is, of course, an immense number of different words, and most of them are not very useful for document classification. This presents a classic feature selection problem. Some words—e.g., function words, often called stopwords—can usually be eliminated a priori, but although these occur very frequently there are not all that many of them. Other words occur so rarely that they are unlikely to be useful for classification. Paradoxically, infrequent words are common—nearly half the words in a document or corpus of documents occur just once. Nevertheless, such an overwhelming number of words remains after these word classes are removed that further feature selection may be necessary using the methods described in Section 8.1. Another issue is that the bag-of-words (or set-of-words) model neglects word order and contextual effects. There is a strong case for detecting common phrases and treating them as single units.

Document classification is supervised learning: the categories are known beforehand and given in advance for each training document. The unsupervised version of the problem is called document clustering. Here there is no predefined class, but groups of cognate documents are sought. Document clustering can assist information retrieval by creating links between similar documents, which in turn allows related documents to be retrieved once one of the documents has been deemed relevant to a query.

There are many applications of document classification. A relatively easy categorization task, language identification, provides an important piece of metadata for documents in international collections. A simple representation that works well for language identification is to characterize each document by a profile that consists of the n-grams, or sequences of n consecutive letters (for some small value such as n=3), that appear in it. The most frequent 300 or so n-grams are highly correlated with the language. A more challenging application is authorship ascription, where a document’s author is uncertain and must be guessed from the text. Here, the stopwords, not the content words, are the giveaway, because their distribution is author dependent but topic independent. A third problem is the assignment of key phrases to documents from a controlled vocabulary of possible phrases, given a large number of training documents that are tagged from this vocabulary.

Information Extraction

Another general class of text mining problems is metadata extraction. Metadata was mentioned above as data about data: in the realm of text the term generally refers to salient features of a work, such as its author, title, subject classification, subject headings, and keywords. Metadata is a kind of highly structured (and therefore actionable) document summary. The idea of metadata is often expanded to encompass words or phrases that stand for objects or “entities” in the world, leading to the notion of entity extraction. Ordinary documents are full of such terms: phone numbers, fax numbers, street addresses, email addresses, email signatures, abstracts, tables of contents, lists of references, tables, figures, captions, meeting announcements, Web addresses, and more. In addition, there are countless domain-specific entities, such as international standard book numbers (ISBNs), stock symbols, chemical structures, and mathematical equations. These terms act as single vocabulary items, and many document processing tasks can be significantly improved if they are identified as such. They can aid searching, interlinking and cross-referencing between documents.

How can textual entities be identified? Rote learning, i.e., dictionary lookup, is one idea, particularly when coupled with existing resources—lists of personal names and organizations, information about locations from gazetteers, or abbreviation and acronym dictionaries. Another is to use capitalization and punctuation patterns for names and acronyms; titles (Ms.), suffixes (Jr.), and baronial prefixes (von); or unusual language statistics for foreign names. Regular expressions suffice for artificial constructs such as uniform resource locators (URLs); explicit grammars can be written to recognize dates and sums of money. Even the simplest task opens up opportunities for learning to cope with the huge variation that real-life documents present. As just one example, what could be simpler than looking up a name in a table? But the name of the former Libyan leader Muammar Qaddafi is represented in 47 different ways in documents that have been received by the Library of Congress!

Many short documents describe a particular kind of object or event, combining entities into a higher-level composite that represent the document’s entire content. The task of identifying the composite structure, which can often be represented as a template with slots that are filled by individual pieces of structured information, is called information extraction. Once the entities have been found, the text is parsed to determine relationships among them. Typical extraction problems require finding the predicate structure of a small set of predetermined propositions. These are usually simple enough to be captured by shallow parsing techniques such as small finite-state grammars, although matters may be complicated by ambiguous pronoun references and attached prepositional phrases and other modifiers. Machine learning has been applied to information extraction by seeking rules that extract fillers for slots in the template. These rules may be couched in pattern-action form, the patterns expressing constraints on the slot-filler and words in its local context. These constraints may involve the words themselves, their part-of-speech tags, and their semantic classes.

Taking information extraction a step further, the extracted information can be used in a subsequent step to learn rules—not rules about how to extract information but rules that characterize the content of the text itself. These rules might predict the values for certain slot-fillers from the rest of the text. In certain tightly constrained situations, such as Internet job postings for computing-related jobs, information extraction based on a few manually constructed training examples can compete with an entire manually constructed database in terms of the quality of the rules inferred.

There is no real consensus about what text mining covers: broadly interpreted, all natural language processing comes under the ambit of text mining. Since their introduction, conditional random fields have been and remain one of the dominant tools in this area. The problem of extracting meeting information from unstructured emails, mentioned in Chapter 9, Probabilistic methods, is just one example: many other information extraction tasks have similar conditional random field formulations.

Natural Language Processing

Natural language processing, a rich field of study with a long history, is an active application area for deep learning. We have seen how latent semantic analysis (LSA) and latent Dirichlet allocation (LDAb) permit exploratory topic analysis of document collections. More recently, it has been observed that neural language modeling techniques can perform significantly better than LSA for preserving relationships among words; moreover, LDAb also has difficulty scaling to truly massive data.

Researchers at Google have created a set of language models called word2vec, based on single-hidden-layer networks trained with vast amounts of data—783 million words for initial experiments, and 30 billion for later ones. (Associated software is available online.) One such model produces continuous representations of words by training a neural bag-of-words model to predict words given their context. Because word order in the context window is not captured, this is known as a “continuous bag of words” model. Another model, “skip-gram,” gives each word to a log-linear classifier with a linear projection layer—a form of shallow neural network—that predicts nearby words within a certain distance before and after the source word. Here the number of states for the output prediction equals the vocabulary size, and with data at this scale the vocabulary ranges from 105 to 109 terms, so the output is decomposed into a binary tree known as a “hierarchical softmax,” which, for a V-word vocabulary needs to evaluate only log2(V) rather than V output nodes.

A particularly noteworthy aspect of this work is that the representations learned yield projections for words that allow inferences about their meaning to be performed with vector operations. For example, upon projecting the words Paris, France, Italy and Rome into the learned representation, one finds that simple vector subtraction and addition yield the relationship ParisFrance+Italy imageRome! More precisely, Rome is found to be the closest word when all words are projected into this representation.

Many research and development groups are mining massive quantities of text data in order to learn as much as possible from scratch, replacing features that have previously been hand-engineered by ones that are learned automatically. Large neural networks are being applied to tasks ranging from sentiment classification and translation to dialog and question answering. The deep encoder-decoder architecture discussed at the end of Chapter 10, Deep learning, represents one such example: Google researchers have used it to learn how to translate languages from scratch, based on voluminous data.

13.6 Web Mining

The World Wide Web is a massive repository of text. Almost all of it differs from ordinary “plain” text because it contains explicit structural markup. Some markup is internal and indicates document structure or format; other markup is external and defines explicit hypertext links between documents. Both these information sources give additional leverage for mining Web documents. Web mining is like text mining but takes advantage of this extra information and often improves results by capitalizing on the existence of topic directories and other information on the Web.

Consider internal markup. Internet resources that contain relational data—telephone directories, product catalogs, and so on—use hypertext markup language (HTML) formatting commands to clearly present the information they contain to Web users. However, it is quite difficult to extract data from such resources in an automatic way. To do so, software systems use simple parsing modules called wrappers to analyze the page structure and extract the requisite information. If wrappers are coded by hand, which they often are, this is a trivial kind of web mining because it relies on the pages having a fixed, predetermined structure from which information can be extracted algorithmically. But pages rarely obey the rules. Their structures vary; Web sites evolve. Errors that are insignificant to human readers throw automatic extraction procedures completely awry. When change occurs, adjusting a wrapper manually can be a nightmare that involves getting your head around the existing code and patching it up in a way that does not cause breakage elsewhere.

Wrapper Induction

Enter wrapper induction—learning wrappers automatically from examples. The input is a training set of pages along with tuples representing the information derived from each page. The output is a set of rules that extracts the tuples by parsing the page. For example, the rules might look for certain HTML delimiters—paragraph boundaries (<p>), list entries (<li>), or boldface (<b>)—that the Web page designer has used to set off key items of information, and learn the sequence in which entities are presented. This could be accomplished by iterating over all choices of delimiters, stopping when a consistent wrapper is encountered. Then recognition will depend only on a minimal set of cues, providing some defense against extraneous text and markers in the input. Alternatively, one might follow Epicurus’s advice at the end of Section 5.10 and seek a robust wrapper that uses multiple cues to guard against accidental variation. The great advantage of automatic wrapper induction is that when errors are caused by stylistic variants it is a simple matter to add these to the training data and reinduce a new wrapper that takes them into account. Wrapper induction reduces recognition problems when small changes occur and makes it far easier to produce new sets of extraction rules when structures change radically.

Page Rank

One of the problems with the Web is that a lot of it is rubbish. In order to separate the wheat from the chaff, a metric called PageRank was introduced by the founders of Google; it is used in various guises by other search engines too, and in many other Web mining applications. It attempts to measure the prestige of a Web page or site, where prestige is, according to a dictionary definition, “high standing achieved through success or influence.” The hope is that this is a good way to determine authority, defined as “an accepted source of expert information or advice.” Recall that the PageRank algorithm was identified in Table 13.1 as one of the top 10 data mining algorithms, the only one that we have not encountered so far. It is perhaps questionable whether it should be classed as a data mining algorithm, but it is worth describing all the same.

The key is external markup in the form of hyperlinks. In a networked community, people reward success with links. If you link to my page, it’s probably because you find it useful and informative—it’s a successful Web page. If a host of people link to it, that indicates prestige: my page is successful and influential. Look at Fig. 13.1, which shows a tiny fraction of the Web, including links between pages. Which ones do you think are most authoritative? Page F has five incoming links, which indicates that five people found it worth linking to, so there’s a good chance that this page is more authoritative than the others. B is second best, with four links.

image
Figure 13.1 A tangled web.

Merely counting links is a crude measure. Some Web pages have thousands of outgoing links whereas others have just one or two. Rarer links are more discriminating and should count more than others. A link from your page to mine bestows more prestige if your page has few outlinks. In Fig. 13.1 the many links emanating from page A mean that each one carries less weight, simply because A is a prolific linker. From F’s point of view, the links from D and E may be more valuable than the one from A. There is another factor: a link is more valuable if it comes from a prestigious page. The link from B to F may be better than the others into F because B is more prestigious. Admittedly this factor involves a certain circularity, and without further analysis it’s not clear that it can be made to work. But indeed it can.

Here are the details. We define the PageRank of a page to be a number between 0 and 1 that measures its prestige. Each link into the page contributes to its PageRank. The amount it contributes is the PageRank of the linking page divided by the number of outlinks from it. The PageRank of any page is calculated by summing that quantity over all links into it. The value for D in Fig. 13.1 is calculated by adding one-fifth of the value for A (because it has five outlinks) to one-half the value for C.

A simple iterative method is used to resolve the apparently circular nature of the calculation. Start by randomly assigning an initial value to each page. Then recompute each page’s PageRank by summing the appropriate quantities, described earlier, over its inlinks. If the initial values are thought of as an approximation to the true value of PageRank, the new values are a better approximation. Keep going, generating a third approximation, and a fourth, and so on. At each stage, recompute the PageRank for every page in the Web. Stop when, for every page, the next iteration turns out to give almost exactly the same PageRank as the previous one.

Subject to two modifications discussed below, this iteration is guaranteed to converge, and fairly quickly too. Although the precise details are shrouded in secrecy, today’s search engines probably seek an accuracy for the final values of between 10−9 and 10−12. An early experiment reported 50 iterations for a much smaller version of the Web than today’s, before the details became commercial; several times as many iterations are needed now. Google is thought to run programs for several days to perform the PageRank calculation for the entire Web, and the operation is—or at any rate used to be—performed every few weeks.

There are two problems with the calculation we have described. You probably have a mental picture of PageRank flowing through the tangled web of Fig. 13.1, coming into a page through its inlinks and leaving it through its outlinks. What if there are no inlinks (page H)? Or no outlinks (page G)?

To operationalize this picture, imagine a surfer who clicks links at random. They takes the current page, choose an outlink at random, and go to that link’s target page. The probability of taking any particular link is smaller if there are many outlinks, which is exactly the behavior we want from PageRank. It turns out that the PageRank of a given page is proportional to the probability that the random surfer lands on that page.

Now the problem raised by a page with no outlinks becomes apparent: it’s a PageRank sink because when surfers comes in they cannot get out. More generally, a set of pages might link to each other but not to anywhere else. This incestuous group is also a PageRank sink: the random surfer gets stuck in a trap. And a page with no inlinks? Random surfers never reach it. In fact, they never reach any group of pages that has no inlinks from the rest of the Web, even though it may have internal links, and outlinks to the Web at large.

These two problems mean that the iterative calculation described above does not converge, as we earlier claimed it would. But the solution is simple: teleportation. With a certain small probability, just make the surfer arrive at a randomly chosen page instead of following a link from the one they are on. That solves both problems. If surfers are stuck at G they will eventually teleport out of it. And if they can’t reach H by surfing, they will eventually teleport into it.

The teleport probability has a strong influence on the rate of convergence of the iterative algorithm—and on the accuracy of its results. At the extreme, if it were equal to 1, meaning that the surfer always teleported, the link structure would have no effect on PageRank, and no iteration would be necessary. If it were 0 and the surfer never teleported, the calculation would not converge at all. Early published experiments used a teleportation probability of 0.15; some speculate that search engines increase it a little to hasten convergence.

Instead of teleporting to a randomly chosen page, you could choose a predetermined probability for each page, and—once you had decided to teleport—use that probability to determine where to land. This does not affect the calculation. But it does affect the result. If a page was discriminated against by receiving a smaller probability than the others, it would end up with a smaller PageRank than it deserves. This gives search engine operators an opportunity to influence the results of the calculation—an opportunity that they probably use to discriminate against certain sites (e.g., ones they believe are trying to gain an unfair advantage by exploiting the PageRank system). This is the stuff of which lawsuits are made.

13.7 Images and Speech

Until recently, images and speech signals received less attention from data mining researchers. However, the deep learning renaissance has changed all that. Signal processing in general is an area where massive amounts of data are easily available, and seems to be well suited to the kind of automatic extraction of low-level features that are built up into successively higher levels that deep networks do so well. The world abounds with signal data—but it is generally unlabeled. The production of large collections of labeled data is stimulating practitioners to apply deep learning techniques to many signal processing tasks, principally image recognition and face verification and recognition.

Images

Chapter 10, Deep learning, discussed how deep learning techniques have revolutionized aspects of the field of computer vision. This impact has only been possible because academic groups and technology companies made substantial investments in mining and labeling data on a large scale. The availability of carefully curated datasets enables high-capacity supervised or discriminative deep learning techniques to push recognition performance into the realm of usability for many applications. In addition to the examples highlighted below, convolutional neural network architectures have been used on tasks ranging from segmenting brain tumors to computing the depth in a scene from stereo cameras; the Further reading section gives pointers to specific examples.

Deep convolutional neural network techniques have transformed the field of object category recognition. On large-scale visual recognition challenges such as ImageNet they can outperform people in terms of agreement with human judgments. ImageNet imagery was mined from the Internet and labeled in a large-scale academic project. Google has prepared a dataset of digits obtained by mining and labeling house numbers in their massive collection of Street View images; again, deep convolutional networks outperform people in terms of agreement with human judgments.

There is an important distinction between recognizing object categories and identifying particular objects: identifying your car in a photograph is quite different from recognizing that a car is present. Here also deep learning based methods are continually improving recognition performance. An alternative approach, “SIFT descriptors,” which are based on the scale-invariant feature transform, are already deployed in systems that recognize specific objects. These descriptors are computed from “interest points” that are found in an image—things like corners, which are easy to find and localize in images of the same object from different viewpoints. Once these points have been found, a descriptor inspired by the human visual system is computed based on a trick that helps select an appropriate scale and orientation. This involves cutting out a patch with a certain orientation and rotation, and then creating a histogram of image gradients using the patch. While SIFT descriptors find their greatest application in 3D reconstruction, panoramic image creation and robotic mapping, they are also widely used for identifying and tracking specific instances of objects.

Face recognition, an important special case of object recognition, has been the subject of intense research for decades—and deep convolutional networks have transformed the field. Better-than-human performance has been achieved on the task of face verification, using databases such as the “Labeled faces in the wild” data collected by the University of Massachusetts. In face verification one is given two photographs of faces and asked whether they belong to the same person or not. Good results are achieved using a special “Siamese network” architecture that takes two images as input and compares them using the internal representation of the convolutional neural network. The images are generally preprocessed to center the faces, crop them, and register them on a common coordinate system. Facial keypoint or landmark detection methods are used to help with the registration and warping process, or to localize key zones to be used as input.

It is worth noting, given the discussion of data mining and ethics in Chapter 1, What’s it all about?, that face verification and face recognition raise difficult ethical issues. Federal governments deploy the technology in the fight against international terrorism; airports use it to reduce lineups at immigration. Its potential application in widespread video surveillance has a profound effect on the balance between security and privacy, and other civil liberties. At the individual level, stalkers would welcome end-user web services for face recognition.

Speech

Speech recognition is rapidly becoming a widely used technology. Major companies use large datasets to render their systems robust to different speakers and noise sources. The classical architecture uses a signal processing front end to compute features inspired by the human auditory system from a spectral analysis of the audio input, and passes them to a large system of hidden Markov models that have Gaussian mixture models for their observation likelihoods. Sophisticated language models are used to help disambiguate what words are present in the audio. Here again deep learning methods have been having substantial impact, and many large industrial groups are replacing elements of the classic speech recognition pipeline with deep learning techniques such as recurrent neural networks.

13.8 Adversarial Situations

A prime application of machine learning is junk email filtering. When we wrote the second edition of this book (late 2004) the scourge of unwanted email was a burning issue; by the writing of the third edition of this book (early 2010), the problem had abated despite the continual growth of spam email (by some estimates it accounts for 95% of all emails). This is largely due to the widespread use of spam filtering, which often uses learning techniques. At first blush junk email filtering appears to present a standard problem of document classification: divide documents into “ham” and “spam” on the basis of the text they contain, guided by training data, of which there are copious amounts. But it differs from ordinary document classification because it involves an adversarial aspect. The documents that are being classified are not chosen at random from an unimaginably huge set of all possible documents; they contain emails that are carefully crafted to evade the filtering process, designed specifically to beat the system.

Early spam filters simply discarded messages containing “spammy” words that connote such things as sex, lucre, and quackery. Of course, much legitimate correspondence concerns gender, money, and medicine: a balance must be struck. So filter designers recruited Bayesian text classification schemes that learned to strike an appropriate balance during the training process. Spammers quickly adjusted with techniques that concealed the spammy words by misspelling them; overwhelmed them with legitimate text, perhaps printed in white on a white background so that only the filter saw it; or simply put the spam text elsewhere, in an image or a URL that most mail readers download automatically.

The problem is complicated by the fact that it is hard to compare spam detection algorithms objectively. Although training data abounds, privacy issues preclude publishing large public corpora of representative email. And there are strong temporal effects. Spam changes character rapidly, invalidating sensitive statistical tests such as cross-validation. Finally, the bad guys can also use machine learning. For example, if they could get hold of examples of what your filter blocks and what it lets through, they could use this as training data to learn how to evade it.

There are, unfortunately, many other examples of adversarial learning situations in our world today. Closely related to junk email is search engine spam: sites that attempt to deceive Internet search engines into placing them prominently in lists of search results. Highly ranked pages yield direct financial benefits to their owners because they present opportunities for advertising, providing strong motivation for profit seekers. Then there are the computer virus wars, in which designers of viruses and virus-protection software react to one another’s innovations. Here the motivation tends to be general disruption and denial of service rather than monetary gain.

Computer network security is a continually escalating battle. Protectors harden networks, operating systems, and applications, and attackers find vulnerabilities in all three areas. Intrusion detection systems sniff out unusual patterns of activity that might be caused by a hacker’s reconnaissance activity. Attackers realize this and try to obfuscate their trails, perhaps by working indirectly or by spreading their activities over a long time—or, conversely, by striking very quickly. Machine learning is being applied to this problem in an attempt to discover semantic connections among attacker traces in computer network data that intrusion detection systems miss. This is a large-scale problem: audit logs used to monitor computer network security can amount to gigabytes a day even in medium-sized organizations.

Many automated threat detection systems are based on matching current data to known attack types. The U.S. Federal Aviation Administration developed the Computer Assisted Passenger Pre-Screening System (CAPPS), which screens airline passengers on the basis of their flight records and flags individuals for additional checked baggage screening. Although the exact details are unpublished, CAPPS is, e.g., thought to assign higher threat scores to cash payments. However, this approach can only spot known or anticipated threats. Researchers are using unsupervised approaches such as anomaly and outlier detection in an attempt to detect suspicious activity. As well as flagging potential threats, anomaly detection systems can be applied to the detection of illegal activities such as financial fraud and money laundering.

Data mining is being used today to sift through huge volumes of data in the name of homeland defense. Heterogeneous information such as financial transactions, health-care records, and network traffic is being mined to create profiles, construct social network models, and detect terrorist communications. This activity raises serious privacy concerns and has resulted in the development of privacy-preserving data mining techniques. These algorithms try to discern patterns in the data without accessing the original data directly, typically by distorting it with random values. To preserve privacy, they must guarantee that the mining process does not receive enough information to reconstruct the original data. This is easier said than done.

On a lighter note, not all adversarial data mining is aimed at combating nefarious activity. Multiagent systems in complex, noisy real-time domains involve autonomous agents that must both collaborate in a team and compete against antagonists. If you are having trouble visualizing this, think soccer. Robo-soccer is a rich and popular domain for exploring how machine learning can be applied to such difficult problems—primarily using reinforcement learning techniques, which are beyond the scope of this book. Players must not only hone low-level skills but must also learn to work together and adapt to the behavior patterns of different opponents.

Finally, machine learning has been used to solve an actual historical literary mystery by unmasking a prolific author who had attempted to conceal his identity. Ben Ish Chai was the leading rabbinic scholar in Baghdad in the late 19th century. Among his vast literary legacy are two separate collections of about 500 Hebrew-Aramaic letters written in response to legal queries. He is known to have written one collection. Although he claims to have found the other in an archive, historians suspect that he wrote it, too, but attempted to disguise his authorship by deliberately altering his style. The problem this case presents to machine learning is that there is no corpus of work to ascribe to the mystery author. There were a few known candidates, but the letters could equally well have been written by anyone else. A new technique appropriately called unmasking was developed that creates a model to distinguish the known author’s work A from the unknown author’s work X, iteratively removes those features that are most useful for distinguishing the two, and examines the speed with which cross-validation accuracy degrades as more features are removed. The hypothesis is that if work X is written by work A’s author, who is trying to conceal his identity, whatever differences there are between work X and work A will be reflected in only a relatively small number of features compared with the differences between work X and the works of a completely different author, say the author of work B. In other words, when work X is compared with works A and B, the accuracy curve as features are removed will decline much faster for work A than it does for work B. It was concluded that Ben Ish Chai did indeed write the mystery letters. This technique is a striking example of original and creative use of machine learning in an adversarial situation.

13.9 Ubiquitous Data Mining

We began this book by pointing out that we are overwhelmed with data. Nowhere does this impact the lives of ordinary people more than on the World Wide Web. No-one can keep pace with the information explosion. Whereas data mining originated in the corporate world because that’s where the databases are, text and web mining are moving machine learning technology out of the companies and into the home. Whenever we are overwhelmed by data on the Web, mining techniques promise tools to tame it. Applications are legion. Finding friends and contacting them, maintaining financial portfolios, shopping for bargains in an electronic world, using data detectors of any kind—all of these could be accomplished automatically without explicit programming. Already mining techniques are being used to predict what link you’re going to click next, to organize documents for you, sort your mail, and prioritize your search results. In a world where information is overwhelming, disorganized and anarchic, text and web mining may be the solution we so desperately need.

Many believe that the Web is but the harbinger of an even greater paradigm shift: ubiquitous computing. Small portable devices are everywhere—mobile phones, personal digital assistants, personal stereo and video players, digital cameras, mobile Web access. Already some devices integrate all these functions. They know our location in physical time and space, help us communicate in social space, organize our personal planning space, recall our past, and envelop us in global information space. It is easy to find scores of processors in a middle-class home in the US today, and they often communicate with one another and with the global information infrastructure. Thus the potential for data mining will soar.

Take consumer music. Popular music leads the vanguard of technological advance. Sony’s original Walkman paved the way to today’s ubiquitous portable electronics. Apple’s iPOD pioneered large-scale portable storage. Napster’s network technology spurred the development of peer-to-peer protocols. Recommender systems such as Firefly brought computing to social networks. Content-aware music services are migrating to portable devices. Applications for data mining in networked communities of people are legion: discovering musical trends, tracking preferences and tastes, and analyzing listening behaviors.

Ubiquitous computing weaves digital space closely into real-world activities. To many, extrapolating their own computer experiences of extreme frustration, arcane technology, perceived personal inadequacy, and machine failure, this sounds like a nightmare. But proponents point out that it can’t be like that, because, if it is, it won’t work. Today’s visionaries foresee a world of “calm” computing in which hidden machines silently cooperate behind the scenes to make our lives richer and easier. They’ll reach beyond the big problems of corporate finance and school homework to the little annoyances such as where are the car keys, can I get a parking place, and is that shirt I saw last week at Macy’s still on the rack? Clocks will find the correct time after a power failure, the microwave will download new recipes from the Internet, kid’s toys will refresh themselves with new games and new vocabularies. Clothes labels will track washing, coffee cups will alert cleaning staff to mold, light switches will save energy if no-one is in the room, and pencils will digitize everything we draw. Where will data mining be in this new world? Everywhere!

It’s hard to point to examples of a future that does not yet exist. But advances in user interface technology are suggestive. Many repetitive tasks in direct-manipulation computer interfaces cannot be automated with standard application tools, forcing users to perform the same interface actions over and over again. This typifies the frustrations alluded to previously: who’s in charge—me or it? Experienced programmers might write a script to carry out such tasks on their behalf, but as operating systems accrue layer upon layer of complexity the power of programmers to command the machine is eroded and vanishes altogether when complex functionality is embedded in appliances rather than in general-purpose computers.

Research in programming by demonstration enables ordinary users to automate predictable tasks without requiring any programming knowledge at all. The user need only know how to perform the task in the usual way to be able to communicate it to the computer. One system, called Familiar, helps users automate iterative tasks involving existing applications on Macintosh computers. It works across applications and can work with completely new ones never before encountered. It does this by using Apple’s scripting language to glean information from each application and exploiting that information to make predictions. The agent tolerates noise. It generates explanations to inform the user about its predictions, and incorporates feedback. It’s adaptive: it learns specialized tasks for individual users. Furthermore, it is sensitive to each user’s style. If two people were teaching a task and happened to give identical demonstrations, Familiar would not necessarily infer identical programs—it’s tuned to their habits because it learns from their interaction history.

Familiar employs standard machine learning techniques to infer the user’s intent. Rules are used to evaluate predictions so that the best one can be presented to the user at each point. These rules are conditional so that users can teach classification tasks such as sorting files based on their type and assigning labels based on their size. They are learned incrementally: the agent adapts to individual users by recording their interaction history.

Many difficulties arise. One is scarcity of data. Users are loth to demonstrate several iterations of a task—they think the agent should immediately catch on to what they are doing. Whereas a data miner would consider a 100-instance dataset miniscule, users bridle at the prospect of demonstrating a task even half a dozen times. A second difficulty is the plethora of attributes. The computer desktop environment has hundreds of features that any given action might depend upon. This means that small datasets are overwhelmingly likely to contain attributes that are apparently highly predictive but nevertheless irrelevant, and specialized statistical tests are needed to compare alternative hypotheses. A third is that the iterative, improvement-driven development style that characterizes data mining applications fails. It is impossible in principle to create a fixed training and testing corpus for an interactive problem such as programming by demonstration because each improvement in the agent alters the test data by affecting how users react to it. A fourth is that existing application programs provide limited access to application and user data: often the raw material on which successful operation depends is inaccessible, buried deep within the application program.

Machine learning is already widely used at work. Text and Web mining is bringing the techniques in this book into our own lives, as we read our email and surf the Web. As for the future, it will be stranger than we can imagine. The spreading computing infrastructure will offer untold opportunities for learning. Machine learning will be in there, behind the scenes, playing a role that will turn out to be foundational.

13.10 Further Reading and Bibliographic Notes

Wu et al. (2008) describe the process of identifying the top 10 algorithms in data mining for presentation at the International Conference on Data Mining in 2006 in Hong Kong and have followed this up with a book that describes all the algorithms (Wu and Kumar, 2009). The paper on the “illusion of progress” in classification is by Hand (2006), and it was he who found that a very simple method achieves more than 90% of the classification improvement yielded by the most sophisticated scheme.

There is a substantial volume of literature that treats the topic of massive datasets, and we can only point to a few references here. Fayyad and Smyth (1995) describe the application of data mining to voluminous data from scientific experiments. Shafer, Agrawal, and Metha (1996) describe a parallel version of a top-down decision tree inducer. A sequential decision tree algorithm for massive disk-resident datasets has been developed by Mehta, Agrawal, and Rissanen (1996). The technique of applying any algorithm to a large dataset by splitting it into smaller chunks and bagging or boosting the result is described by Breiman (1999); Frank, Holmes, Kirkby, and Hall (2002) explain the related pruning and selection scheme.

Early work on incremental work on decision trees is reported by Utgoff (1989) and Utgoff, Berkman, and Clouse (1997). The Hoeffding tree was introduced by Domingos and Hulten (2000). Our description of it, including extensions and improvement, closely follows Kirkby’s (2007) PhD thesis. The MOA system is described by Bifet, Holmes, Kirkby, and Pfahringer (2010).

Despite its importance, comparatively little seems to have been written about the general problem of incorporating metadata into practical data mining. A scheme for encoding domain knowledge into propositional rules and its use for both deduction and induction has been investigated by Giraud-Carrier (1998). The related area of inductive logic programming, which deals with knowledge represented by first-order logic rules, is covered by Bergadano and Gunetti (1996). Probabilistic logic learning is covered by de Raedt (2008).

Text mining is a broad area, and there are few comprehensive surveys of the area as a whole: Witten (2004) provides one. A large number of feature selection and machine learning techniques have been applied to text categorization (Sebastiani, 2002). Martin (1995) describes applications of document clustering to information retrieval. Cavnar and Trenkle (1994) show how to use n-gram profiles to ascertain with high accuracy the language in which a document is written. The use of support vector machines for authorship ascription is described by Diederich, Kindermann, Leopold, and Paass (2003); the same technology was used by Dumais, Platt, Heckerman, and Sahami (1998) to assign key phrases from a controlled vocabulary to documents on the basis of a large number of training documents. The use of machine learning to extract key phrases from the document text has been investigated by Turney (1999), Frank, Paynter, Witten, Gutwin, and Nevill-Manning (1999) and Medelyan and Witten (2008).

Appelt (1999) describes many problems of information extraction. Many authors have applied machine learning to seek rules that extract slot-fillers for templates, e.g., Soderland, Fisher, Aseltine, and Lehnert (1995), Huffman (1996), and Freitag (2002). Califf and Mooney (1999) and Nahm and Mooney (2000) investigated the problem of extracting information from job ads posted on Internet newsgroups. An approach to finding information in running text based on compression techniques has been reported by Witten, Bray, Mahoui, and Teahan (1999a). Mann (1993) notes the plethora of variations of Muammar Qaddafi on documents received by the Library of Congress.

Chakrabarti (2003) has written an excellent and comprehensive book on techniques of Web mining. Kushmerick, Weld, and Doorenbos (1997) developed techniques of wrapper induction. The founders of Google wrote an early paper that introduced the PageRank algorithm (Brin and Page, 1998). At the same time Kleinberg (1998) described a system called HITS (for hypertext-induced topic selection) that has some superficial similarities but produces strikingly different results.

The first paper on junk email filtering was written by Sahami, Dumais, Heckerman, and Horvitz (1998). Our material on computer network security is culled from Yurcik et al. (2003). The information on the CAPPS system comes from the U.S. House of Representatives Subcommittee on Aviation (2002), and the use of unsupervised learning for threat detection is described by Bay and Schwabacher (2003). Issues with privacy-preserving data mining techniques are discussed by Datta, Kargupta, and Sivakumar (2003). Stone and Veloso (2000) surveyed multiagent systems of the kind that are used for playing robo-soccer from a machine learning perspective. The fascinating story of Ben Ish Chai and the technique used to unmask him is from Koppel and Schler (2004).

A popular early framework for neural probabilistic language models was proposed by Bengio, Ducharme, Vincent, and Janvin (2003): a key element of the approach is to project words into a continuous vector representation. Extending the general idea, Collobert and Weston (2008) present a large unified neural network architecture for performing many natural language processing tasks using the same underlying network and shared word representations, and showed that its performance is competitive across many common tasks. The influential word2vec technique was proposed by Mikolov, Chen, Corrado, and Dean, (2013a, 2013b), while Morin and Bengio (2005) explored the hierarchical softmax technique for language modeling.

Russakovsky et al. (2015) reviews the ImageNet challenge and the rise of convolutional neural networks to prominance in object category recognition. The Google Street View house numbers are described by Netzer et al. (2011), who estimate human performance at 98% accuracy—and now many experiments with convolutional neural networks described in the literature exceed this figure. David Lowe invented SIFT descriptors. Lowe (2004) gives details of their implementation and applications: they work well, and have been patented.

Facebook’s “DeepFace” Siamese architecture for face verification uses a facial keypoint detector followed by a sophisticated 3D warping procedure to frontalize faces (Taigman, Yang, Ranzato, & Wolf, 2014). Of course, Facebook has access to a huge labeled database—4.4 million labeled faces from 4000 people (about 1000 each, on average). Siamese neural architectures were proposed by Bromley, Guyon, LeCun, Säckinger, and Shah (1994). The first system to produce face verification that rivaled human performance was Sun, Chen, Wang, and Tang (2014)’s convolutional neural network, which used a 200,000-image database of faces of 10,000 celebrities. This system uses fewer facial landmarks, and crops patches out of images centered on the key points; ensembles of many models produced the best performance.

Amongst many other applications, Havaei et al. (2016) used deep convolutional neural networks for brain tumor segmentation; Zbontar and LeCun (2015) present impressive results for stereo vision.

The vision of calm computing, as well as the examples we have mentioned, is from Weiser (1996) and Weiser and Brown (1997). More information on different methods of programming by demonstration can be found in compendia by Cypher (1993) and Lieberman (2001). Mitchell, Caruana, Freitag, McDermott, and Zabowski (1994) report some experience with learning apprentices. Familiar is described by Paynter (2000). Permutation tests (Good, 1994) are statistical tests that are suitable for small sample problems: Frank (2000) describes their application in machine learning.

13.11 WEKA Implementations

• HoeffdingTree (creates decision trees and modifies them incrementally).

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.182.62