3.6. Determining When a New Observation Is Not in a Training Set

Often in drug discovery, a computational model is built for a non-statistician and the user relies on easily understood diagnostics and graphical representations to assess the accuracy of the prediction.

Once a model is in use, a common question asked of the statistician is "how accurate is the prediction?". If the model is providing the user with a quantitative estimates, then the error associated with the prediction can be used to answer this question. This measure takes into account the distance between the new observation and the training set as well as the error in the model. If the model is providing the user with a categorical response and a linear discriminant model is used, then a bar chart of the probability that an observation is in each group can be provided to the user.

Another measure that can be provided to the user is the leverage value. Recall the leverage value is simply the distance between the new observation and the centroid of the training data set. If the leverage value of a new observation is greater than the maximum observed leverage value in the training set, then the prediction for the new observation will be unreliable.

Also, it is important for the modeler to keep track of the data sets the model is being applied to. If the data is far from the data the model was trained on, then the modeler should think about retraining the data sets so the predictions will be reliable.

Once the statistician has provided the model, along with easily understood diagnostics, to the team, the statistician should also monitor how the model is being used. If the chemistry space the model is being used on is different from the chemistry space the model was trained on, then the statistician should consider retraining the model to provide the team with a more useful model. This monitor can easily be done by keeping track of the leverage value from each prediction and by using a flag to indicate whether the prediction was an extrapolation or not.

As an example, let's determine if the observations in the solubility test set are included in the solubility training space. Program 3.7 contains the code to fit the selected model to the test set and store the residuals in a SAS data set (RESDS data set). This is accomplished using PROC REG.

Program 3.7 also contains SAS/IML code to calculate the maximum leverage value in the training set. The program reads in the test set, finds the leverage values for each observation and determines which, if any, observations are outside the training set space. If some observations in the test set are outside the space, the program calculates the prediction error excluding these observations.

Example 3-7. Determine if the test set is in the training set space
proc reg data=soltsuse;
    model y=x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11;
    output out=resds r=yresids;
    run;
    quit;
proc iml;
    use soltruse;
    read all var ('x1':'x11') into x;
    * Calculate the maximum leverage value of the training set;
    xtx=x`*x;
    call eigen(evals,evecs,xtx);
    invxtx=evecs*diag(1/evals)*evecs`;
    himax=max(vecdiag(x*invxtx*x`));
    print himax[format=5.3];

* Read in the test set and calculate the test set leverage values;
    * Determine which ones are in training set space;
    use soltsuse;
    read all var ('x1':'x11') into x;
    hi=vecdiag(x*invxtx*x`);
    indi=choose(hi>himax,1,0);
    outindi=loc(indi);
    print outindi;
    use resds;
    read all var{yresids} into e;
    ein=e[loc(^indi),1];
    err=sqrt(sum(e##2)/(nrow(e)−ncol(x)));
    print err[format=5.3];
    quit;

Example. Output from Program 3.7
Analysis of Variance

                                          Sum of           Mean
      Source                   DF        Squares         Square    F Value    Pr > F

      Model                    11       89.03923        8.09448       9.16    <.0001
      Error                    37       32.69240        0.88358
      Corrected Total          48      121.73163

                                          HIMAX

                                          0.269

                                         OUTINDI

                                      6        40        41

                                           ERR

                                          0.928

The output analysis of variance table is contained in Output 3.7. The prediction error of the model, which is the square root of the Mean Square Error, calculated from the test set is . The maximum leverage value in the training set is approximately 0.269, so any observation in the test set whose leverage value is greater than this is not in the training set space. There were three observations in the test set that are outside the training set space (observations 6, 40 and 41). Excluding these observations results in a prediction error for the model of 0.928.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.133.180