2.1. Introduction

Drug discovery teams are often faced with data for which the samples have been categorized into two or more groups. For example, early in the drug discovery process, high throughput screening is used to identify compounds' activity status against a specificbiological target. At a subsequent stage of discovery, screens are used to measure compounds' solubility, permeability, and toxicity status. In other areas of drug discovery, information from animal models on disease classification, survival status, and occurrenceof adverse events is obtained and scrutinized. Based on these categorizations, teams must decide which compounds to pursue for further development.

In addition to the categorical response, discovery data often contains variables that describe features of the samples. For example, many computational chemistry softwarepackages have the ability to generate structural and physical-property descriptors for any defined set of compounds. In genomics and proteomics, expression profiles can be measured on tissue samples.

Given data that contains descriptors and a categorical response, teams desire to uncover relationships between the descriptors and response that can provide them with scientific intuition and help them predict the classification of future compounds or samples.

In drug discovery, data sets are often large and over-described (more descriptors exist than samples or highly correlated descriptors). And often, the relationships between descriptors and classification grouping are complex. That is, the different classes of samples cannot be easily separated by a line or hyperplane. Hence, methods that rely on inverting the covariance matrix of the descriptors, or methods that find the best separating hyperplane are not effective for this type of data. Traditional statistical methods such as linear discriminant analysis or logistic regression are both linear classification methods and rely on the covariance matrix of the descriptors. For the reasons mentioned previously, neither is optimal for modeling many discovery type data sets.

In this situation, one approach is to reduce the descriptor space in a way that stabilizes the covariance matrix. Principal component analysis, variable selection techniques such as genetic algorithms, or ridging approaches are popular ways to obtain a stable covariance matrix. When the matrix is stable, traditional discrimination techniques can be implemented. In each of these approaches, the dimension reduction is performed independently of the discrimination. An alternative approach, partial least squares for linear discrimination, has the added benefit of simultaneously reducing dimension while finding the optimal classification rule.

Another promising classification method is support vector machines (Vapnik, 1996).In short, support vector machines seek to find a partition through the data that maximizes the margin (the space between observations from separate classes), while minimizing classification error. For data in which classes are clearly separable into different groups, it is possible to maximizing the margin while minimizing classification error, regardless of the complexity of the boundary. However, for data whose classes overlap, maximizing the margin and minimizing classification error, are competing constraints. In thiscase, maximizing the margin produces a smoother classification boundary, while minimizing classification error produces a more flexible boundary. Although support vector machines are an effective classification tool, they can easily overfit a complex data set and can be difficult to interpret.

Recursive partitioning seeks to find individual variables from the original data that partition the samples into more pure subsets of the original data (Breiman et al, 1984). Thus, a recursive partition model is a sequence of decision rules on the original variables. Because recursive partitioning selects only one variable at each partition, itcan be used to model an overdetermined data set. This method can also find a more complex classification boundary because it partitions the data into smaller and smaller hypercubes. While a recursive partition model is easy to interpret, its greedy nature can prevent it from finding the optimal classification model for the data.

More recently, methods which combine classifiers, known as ensemble techniques, have been shown to outperform many individual classification techniques. Popular ensemble methods include bagging (Breiman, 1996), ARCing (Breiman, 1998), and boosting (Freund and Schapire, 1996a), which have been shown to be particularly effective classification tools when used in conjunction with recursive partitioning.

Because of their popularity, effectiveness, and ease of implementation in SAS, we will limit the focus of this chapter to boosting and partial least squares for discrimination.

In Section 2.2, we will motivate our methods with an example of a typical drug discovery data set. Section 2.3 will explore boosting and its implementation in SAS as wellas its effectiveness on the motivating example and Section 2.4 will review model building techniques. Section 2.5 will discuss the use of partial least squares for discrimination and SAS implementation issues, and apply this method to the motivating example.

To save space, some SAS code has been shortened and some outputis not shown. The complete SAS code and data sets used in this book are available on the book's companion Web site at http://support.sas.com/publishing/bbu/companion_site/60622.html.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.5.201