2.2. Motivating Example

Generally speaking, permeability is the ability of a molecule to cross a membrane.In the body, key membranes exist in the intestine and brain, and are composed of layers of molecules and proteins organized in a way to prevent harmful substances from crossingwhile allowing essential substances to pass through. The intestine, for instance, allowssubstances such as nutrients to pass from the gut into the blood stream. Another membrane, the blood-brain barrier, prevents detrimental substances from crossing the blood stream into the central nervous system. While a molecule may have the correct characteristics to be effective against a specific disease or condition, it may not have the correct characteristics to pass from the gut into the blood stream or from the blood stream into the central nervous system. Therefore, if a potentially effective compound is not permeable, then its effectiveness may be compromised.

Because a compound's permeability status is critically important to its success, pharmaceutical companies would like to identify poorly permeable compounds as early as possible in the discovery process. These compounds can then be eliminated from follow-up, or can be modified in an attempt to improve permeability while keeping their potential target effectiveness.

To measure a compound's ability to permeate a biological membrane, several in-vitro assays such as PAMPA and Caco-2 have been developed (Kansy et al., 1998). In each of these assays, cell layers are used to simulate a body-like membrane. A compound is then placed into solution and is added to one side of the membrane. After a period of time, the compound concentration is measured on the other side of the membrane. Compounds with poor permeability will have low compound concentration, while compounds with high permeability will have high compound concentration.

These screens are often effective at identifying the permeability status of compounds, but the screening process is moderately labor- and material-intensive. At a typicalpharmaceutical company, resources are limited to screening only a few hundred compounds per week. To process more compounds would require more resources. Alternatively, we could attempt to build a model that would predict permeability status.

As mentioned above, biological membranes are a complex layer of molecules and proteins, To pass through a membrane, a substance must have an appropriate chemical composition. Therefore, to build a predictive model of permeability status we should include appropriate chemical measurements for each compound.

For the example in this chapter, we have collected data on 354 compounds (the PERMY data set can be found on the book's companion Web site). For each compound we have itsmeasured permeability status (y = 0 is not permeable and y = 1 is permeable), and we have used in-house software to compute 71 molecular properties that are theoretically related to permeability (i.e., hydrogen bonding, polarity, molecular weight, etc.). For proprietary reasons, these descriptors have been blinded and are labeled as X1, X2, ..., X71.

As is common with many drug discovery data sets, the descriptor matrix for the data is overdetermined (that is, at least one variable is linearly dependent on one or moreof the other variables). To check this in SAS, we can use PROC PRINCOMP to generate the eigenvalues of the covariance matrix. Recall that a full rank covariance matrix will have no zero eigenvalues. Program 2.1 generates the eigenvalues of the covariance matrix.

Example 2-1. Permeability data set: Computation of the eigenvalues of the covariance matrix
proc princomp data=permy cov n=71 outstat=outpca noprint;
    var x1-x71;
data outpca;
    set outpca(where=(_type_="EIGENVAL"));
    keep x1-x71;
proc transpose data=outpca out=outpca_t(drop=_name_) prefix=eig;
    var x1-x71;
proc print data=outpca_t;
    run;

Example. Output from Program 2.1
Obs         eig1

                             1    531907.98
                             2    202322.50
                             3     39308.42
                             4     22823.33
                             5      9187.10
                             6      5087.36
                             7      2660.49
                             8      1229.59
                             9       814.62
                            10       495.17

                            62   .000003063
                            63   .000000857
                            64   .000000209
                            65   .000000149
                            66            0
                            67            0
                            68            0
                            69            0
                            70            0
                            71            0

Output 2.1 lists the first 10 and last 10 eigenvalues of the covariance matrix computed from the permeability data set. Notice that the covariance matrix for the permeability data is not full rank. In fact, six eigenvalues are zero, while more than 20 are less than 0.01. This implies that the covariance matrix is not invertible and methods that rely on the inverse of the covariance matrix will have a difficult time with this data.

Another common way in which the descriptor matrix can be overdetermined occcurs when we have more descriptors than observations. Methods that rely on a full-rank covariance matrix of the descriptors, such as linear discriminant analysis, will fail with thistype of data. Instead, we must either remove the redundancy in the data or employ methods, such as boosting or partial least squares for linear discrimination, that can look for predictive relationships in the presence of overdetermined data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.103.210