4.2. Balanced and Unbalanced Data

When the design is balanced in the sense that each group has the same number of measurements or certain orthogonality conditions are met (Searle, 1971), the analysis is relatively much simpler with respect to the computations as well as interpretations. In this case for a given response variable, the (univariate) ANOVA partitioning (Searle, 1971) of the corrected total sums of squares into various sources of variation specified by the model is unique. This simplicity is unfortunately lost as soon as the underlying design becomes unbalanced. The partitioning of the corrected total sums of squares is no longer unique in that it depends on the model and the various submodels of it as specified by the order in which various sums of squares are extracted. For example, suppose we have an unbalanced (univariate) two-way classification design with interaction, for which the statistical model is


Denoting the main effects as A and B and the interaction effect as AB and following the notation of Searle (1971), the corrected model sum of squares R(A, B, AB|μ) can be partitioned in the following two alternative ways:


or

where R,(A|μ) is the sum of squares due to A after correcting for μ, and R,(B|μ, A) is the sum of squares due to B after correcting for μ and the variable A (i.e., after discounting the effect of A). Other quantities are similarly defined. Unless the design is balanced, R,(B|μ, A) ≠ R(B|μ) and R,(A|μ, B) ≠ R(A|μ). The complexity increases further for the higher order unbalanced designs. As a result, SAS computes the four types of sums of squares, commonly referred to as Type I through Type IV sums of squares. A brief summary of these four sums of squares, adopted from Littell, Freund and Spector (1991) follows.

  • The Type I sums of squares represent a partitioning of the model sum of squares into component sums of squares due to each variable or interaction as it is added sequentially to the model in the order prescribed by the MODEL statement (Littell, Freund, and Spector, 1991, p. 20). They are often referred to as sequential sums of squares. In view of their dependence on the order prescribed the corresponding partitioning of the model sum of squares is not unique. For example, for a three-way classification model with all possible interactions in variables A, B, and C the MODEL statement

    model y = a b c a*b a*c b*c a*b*c/ss1 ss2 ss3 ss4;

    results in the Type I sum of squares (generated by the use of option SS1) for, say, A*C as the one which is adjusted for all the previous terms in the model: A, B, C, and A*B.

  • The Type II sums of squares for a particular variable represent the increase in the model sum of squares. This increase is due to adding the particular variable or interaction to a model that already contains all the other variables and interactions in the MODEL statement which do not notationally contain the particular variable or interaction (Littell, Freund, and Spector, 1991, p. 21). For example, for the MODEL statement given above, the Type II sums of squares for A*C represents the increase in the model sum of squares by adding A*C while A, B, C, A*B, and B*C have already been included in the model. The three-factor interaction is not included in this because the notational symbol A*B*C contains the symbol A*C. Type II sums of squares do not depend on the order in which the variables and interactions are listed in the MODEL statement. In general, Type II sums of squares for various variables and interactions do not add up to the model sum of squares. Type II sums of squares are commonly called partial sums of squares.

  • The Type III sums of squares are also a kind of partial sums of squares (Littell, Freund, and Spector, 1991, p. 156). They differ from Type II sums of squares in that a particular sum of squares represents increase in the model sum of squares due to adding the particular variable or interaction to a model that contains all the other variables and interactions listed in the MODEL statement. For example, for the MODEL statement given above the Type III sums of squares for A*C represent the increase in the model sum of squares by adding A*C while all the remaining terms in the right-hand side of the MODEL statement, A, B, C, A*B, B*C, and A*B*C, have already been included in the model. As in the case of Type II, Type III sums of squares also do not depend on the order in which the variables and interactions are listed in the MODEL statement. Further, in general Type III sums of squares for various variables and interactions do not add up to the model sum of squares.

  • In case there are empty cells, the Type IV sums of squares are recommended. Unfortunately, they can be discussed only in the general framework of estimable functions and their constructions. For cross-classified unbalanced data, these are not unique when there are empty cells in that they depend on the way the data may have been arranged. When there are no empty cells, Type IV sums of squares are identical to Type III sums of squares (Littell, Freund, and Spector, 1991, p. 156).

For details, see the SAS/STAT User's Guide, Version 6, Fourth Edition, Littell, Freund and Spector (1991), and Milliken and Johnson (1991).

For multivariate analysis purposes, when the data are multivariate in nature we analogously define the sums of squares and crossproducts (SS&CP) matrices rather than just the sums of squares. The partitioning that is essentially similar to ANOVA partitioning, called in the literature MANOVA partitioning (M for multivariate), can be done for the corrected total SS&CP matrix. As is true in the univariate case, we will encounter problems related to the nonuniqueness of this partitioning for the unbalanced data. Needless to say, the interpretations similar to those mentioned in the references given above can be assigned to various types of analyses to help in choosing the appropriate MANOVA partitioning and/or analysis.

Based on Milliken and Johnson (1991) we make the following recommendations for two-way classification models. For most higher order models, a straightforward modification of these recommendations will be applicable, in most situations.

  • Type III SS&CP matrices are appropriate when the interest is in comparing the effects of the experimental variables. The corresponding null hypotheses are equivalent to the hypotheses tested in the balanced classifications. Specifically, for a multivariate version of the two-way classification model stated earlier, the hypotheses being tested are


  • For model-building purposes such as in response surface modeling, where we want to predict the responses, Type I and/or Type II SS&CP is desirable. Usually, since the terms are to be added sequentially in the process of model building, Type I analysis may be more appropriate.

  • In survey designs and observational studies such as in sociology, where the data are collected "passively," rather than "actively" generated under a designed experiment, the number of observations per cell will be approximately proportional to the actual relative frequencies of these cells in the reference population. As a result, the weighted averages with observed cell sizes as weights may be of interest in the course of analyzing the data. In this case, it is advisable to attempt and carefully interpret the two possible sequential analyses using Type I SS&CP leading to the partitioning given by Equations 4.3 and 4.4. Of course, the three-way or other higher order cross-classified designs would require several sequential analyses.

Remember that underlying any test statistic or significant effect as shown in any computer output, there is a specific statement in the null hypothesis which is being tested. In the case of designed experiments, the cell sizes are determined by the experimenter or by certain circumstances which are beyond the control of the experimenter. The effects, significant or not, are the characteristics of the reference population and in no way should be a function of the design parameters such as the cell sizes. It makes no intuitive sense that a null hypothesis would involve these parameters of the particular design. It is, therefore, very important that any appropriate null hypothesis is a priori identified before declaring an effect significant or nonsignificant. This is preferable to retroactively identifying what the hypothesis is, corresponding to a significant or nonsignificant p value associated with a particular test statistic. In fact, in the case of highly unbalanced designs, the SS&CP matrices for the notationally same effects (in the computer output) under Type I, II, or III analyses may correspond to very different null hypotheses. Not surprisingly, one often obtains mutually conflicting conclusions from these analyses. Of course, the best solution to this problem is to construct a design which is as balanced as possible.

The issues related to which of the three sums of squares is appropriate have been the subject of considerable discussion for the past several decades. See Goodnight (1976) and Searle (1987). These issues do not seem to have subsided or been adequately settled or clarified as evident from the recent contributions to this topic. See Dallal (1992), De Long (1994), Goldstein (1994) and Searle (1994). It is thus inevitable not to find a consensus on various modes of analyses considered in the specific examples in this book. Wherever possible, we attempt to intuitively justify the type of analysis chosen, while at the same time deliberately avoiding the complex notational and mathematical representations of the underlying hypotheses.

Type IV analysis is appropriate in the case of missing observations when, for certain cells, the cell frequency is zero. In this case, none of the Type I, II, or III analyses may be entirely satisfactory, and may be difficult to interpret. The Type IV hypotheses are constructed to have balance in the cell mean coefficients in such cases. As a result, meaningful interpretations can be assigned to the underlying hypotheses being tested by these SS&CP matrices. For designs with missing observations, PROC GLM automatically generates certain Type IV hypotheses which can be identified by examining the list of estimable functions generated by SAS under the given design. Unfortunately, the resulting hypotheses being tested may themselves depend on the numbering of the variables. Consequently, the very same set of treatments in the same data set, if renumbered or reordered, may result in a different set of Type IV SS&CP matrices. See Milliken and Johnson (1991) for an especially readable discussion in which the authors devote an entire chapter to these and other related issues, of course in a univariate setting.

The preceding discussion about the unbalanced designs pertains only to the cases where there are an unequal number of observations per cell or where balancedness conditions (Searle, 1971) on the cell sizes are not satisfied. Imbalance may also occur in cases when, for some observations or experiments, the data are available only on some of the response variables and not available on the others. This situation, although quite common in practice, cannot be handled in the standard multivariate analysis of variance setup. As a result, for any multivariate analysis procedure, observations with missing values for one or more response variables are automatically deleted by SAS before any analysis. This is not necessarily an ideal choice because missingness of observations may not have a random pattern and there may be an underlying selection mechanism due to which the observations were missing. In this case, due to selection bias in data, ignoring the missing values may severely affect the analysis and consequently may result in misleading conclusions.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.224.33.107