3.1. Introduction

Computational models are used in many stages of drug discovery to predict structure activity and property relationships. The models built are based on the principle that compounds with similar structures are expected to have similar biological activities (Golbraikh and Tropsha, 2002). In the early stage of drug discovery, high throughput screens are used to identify a compound's activity against a specific target. Computational models used at this stage need to predict a compound's activity but the model does not need to be interpretable. Once a series of compounds has been identified as being active against the biological target, then the goal is to optimize the compound in terms of structure properties such as solubility, permeability, etc. At this stage, it is useful to have a computational model that can be interpreted by the scientists so they understand the impact on the property when a particular feature of the structure is changed. For example, suppose solubility has an inverse relationship with a certain feature of the structure, such as number of rotatable bonds; if the goal is to increase solubility then the team knows that decreasing the number of rotatable bonds will increase solubility. In this chapter we focus on methods the statistician can use to optimize the model building process. As George Box (Box et al, 1978) once said "All models are wrong, but some are useful".

To build a model, a statistician will perform the following steps: divide the data into training and test sets, select the variables, and select the best statistical tool for prediction. There exists a vast body of literature on statistical procedures for model building, a few of which we discuss here. In this chapter, we focus on statistical techniques that can optimize the model building process independent of the statistical procedure used to build the model.

A computational model is evaluated on the ability to predict future observations. One measure of this performance is the prediction error of the model calculated on an independent test set; therefore, it is important that the test set be representative of the training set otherwise the prediction error obtained from the test set will not be indicative of the model's performance.

Molecular descriptors describe geometric, topological and electronic properties of the compound. The molecular descriptors used to build structure activity and property relationship models are often computer generated and many programs are available to generate molecular descriptors. Because of this, the modeler is often faced with hundreds if not thousands of descriptors and since the molecular descriptors are based on attributes of the structure many of these descriptors are highly correlated. To build a model that is interpretable, the goal of variable selection is to choose descriptors that are highly predictive of the response and independent of one another. A computational model built with independent descriptors will be more interpretable than a model built with correlated descriptors.

When using a model to predict a response for a new set of data, it is important to know how different the new data are from the data used to train the model. If the new set of data is "far" from the data used to train and test the model, then one might consider retraining the model because the error in the prediction will be large.

In Section 3.2 we provide a motivating example using a real drug discovery data set. In Section 3.3 we review methods for training and test set selection and we discuss a novel technique that we found useful. In Section 3.4 we provide an overview of variable selection techniques and provide a new technique. In Section 3.5 we briefly review statistical procedures for model building. In Section 3.6 we discuss a simple procedure that can be used to determine if a new observation is in the descriptor space the model was trained on. And in Section 3.7 we use SAS Enterprise Miner to build a computational model.

To save space, some SAS code has been shortened and some output is not shown. The complete SAS code and data sets used in this book are available on the book's companion web site at http://support.sas.com/publishing/bbu/companion_site/60622.html

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.149.213.44