Validity and other topics

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

13.3. Validity and other topics

A measure is said to be valid if it accurately reflects the concept it is intended to measure. Assessing the validity of a measure typically includesquantifying convergent validity, divergent validity, and discriminant validity. These three topics, which comprise the assessment of the construct validity of an instrument, can be addressed in a quantitative fashion and will be discussed below.

Content validity is another topic which is typically addressed in the development stage of a measurement tool. Content validity refers to the degree to which a measure covers the range of meanings that are part of the concept to be measured. The most common approaches to assessing content validity are expert reviews of the clarity, comprehensiveness, and redundancy of the measurement tool. Content validity will not be discussed further in this chapter.

However, responsiveness, factor analysis, and minimal clinically relevant differences are three additional topics that are discussed. Responsiveness and factor structure of a measurement tool are two important practical concepts to assess when validating an instrument. Establishing a minimal clinically relevant difference addresses a common need for applied researchers designing studies utilizing and interpreting data from studieswith a measurement tool.

13.3.1. Convergent and Divergent Validity

Convergent validity is established by showing a strong relationship between the scale under review and another validated scale thought to measure thesame construct. Divergent validity is the lack of correlation between the scale under review and scales thought to assess different constructs. Pearson's correlation coefficient is the most commonly used statistic to quantify convergent and divergent validity. Thus, SAS can be easily used to summarize convergent and divergent validity through the CORR. procedure A Pearson's correlation of at least 0.4 has been utilized (Cappelleri et al, 2004) as evidence for convergent validity. Similarly, correlations less than 0.3 indicate evidence for divergent validity, with correlations between 0.3 and 0.4 taken as no evidence to establish or dismiss convergent or divergent validity. Confidence intervals are useful here to provide information beyond simple point estimates.

Cappelleri et al. (2004) provides and example of the assessment of convergent and divergent validity in assessing a new scale for self-esteem and relationships (SEAR) in patients with erectile dysfunction. As part of the assessment of convergent/divergent validity, they included a global quality of life scale (SF-36) in a validation study. Theyhypothesized low correlations between the Confidence domain of the SEAR with physical factors of the SF-36 and thus utilized these comparisons to assess divergent validity. Similarly for convergent validity, the developers hypothesized higher correlations with SF-36 mental health domains. Table 13.3 presents the results.

Table 13-3. Correlation of SF-36 components with SEAR confidence domain
SF-36 component	Correlation
Physical Functioning	0.30
Role-Physical	0.30
Bodily Pain	0.32
Mental Health	0.44
Role-Emotional	0.45
Mental Component Summary	0.45

The correlations greater than 0.4 with the mental health measures were taken as evidence for convergent validity while the correlations with physical domains were borderline evidence (at least no evidence to dismiss) for divergent validity.

13.3.2. Discriminant validity

Discriminant validity indicates the ability of a scale to distinguish different groups of subjects. An instrument for assessing the severity of a disease should clearly be able to distinguish between subjects with and without the corresponding diagnosis. Often, however, many disease symptoms overlap with other diseases, such as patients diagnosed with depression versus anxiety disorder. Thus in establishing validity of a measure for a particular disease, ability to distinguish between a disease state with overlapping symptoms may be extremely important. Note that there is some inconsistency in the literature regarding the use of the term discriminant validity—with references defining this in the same meaning as divergent validity above (for example,Guyatt, 1991). However, in this discussion discriminant validity is defined as the ability to distinguish between relevant subject groups.

Thurber et al. (2002) assessed the discriminant validity of the Zung self-rating depression scale in 259 individuals referred to a state vocational rehabilitation service. The Zung scale was administered as part of a test battery, including the depression subscale of the Minnesota Multiphasic Personality Inventory-2 (MMPI-2), following a diagnostic interview. Using logistic regression and stepwise logistic regression (such as the LOGISTIC procedure), they demonstrated that the Zung scale was a predictor (the strongest) of depressed versus non-depressed individuals, as well as between individuals diagnosed with depression or substance abuse. The Zung scale was also predictive of a diagnosis of depression even after the forced inclusion of the MMPI-2 depression subscale into the model.

Once the scale has been demonstrated to be predictive of a diagnosis, the ability of the scale to predict the diagnosis for individual subject is often assessed using receiver operator characteristic (ROC) curves. The ROC is a graph of sensitivity versus (1-specificity) for various cutoff scores. For these data, based on a ROC curve, Thurber chose a cutoff score of 60 on the Zung scale for identifying subjects with and without a diagnosis of depression. With this cutoff score, sensitivity (the proportion of depressedsubjects correctly identified) was 0.57, while specificity (the proportion of subjects without depression who were correctly identified) was 0.83. See Deyo et al.(1991) for a further discussion of ROC and an example using changes in SIP scores (see Section 13.2.7)to predict improvement status.

13.3.3. Responsiveness

Responsiveness is the ability of an instrument to detect small but clinically important changes. Responsiveness is often referred to as `sensitivity to change and as often viewed as part of construct validity. Because the main purpose of a clinical trial is to detect a treatment effect, it is important to assess the responsiveness of a scale prior to using in a clinical trial. A scale thatis not responsive may not be able to detect important treatment changes and therefore mislead the experimenter to conclude no treatment effect. The methods discussed above are predominantly point in time analyses (except for test-retest reliability which focuses on stability of scores) and do not fully demonstrate that an instrument would be effective for a clinical trial designed to detect differences in change scores. The standardized response mean (the mean change divided by the standard deviation in change scores) is a common unitless statistic for summarizing responsiveness (Stratford et al., 1996). Change scores may be summarized following an intervention expected to produce a clinically relevant change. When a control is available, effect sizes for an effective intervention relative to the control may be compared using multiple measures.

For example, Faries et al. (2001) assessed the responsiveness of the ADHD rating scale when administered by trained clinicians. The scale had been previously validated as a parent scored tool. As no control group was available, the SRM was utilized to compare the changes observed on the new version of the scale with other validated instruments.Results showed the observed SRM for the clinician scored scale (1.21) was in the range of SRMs observed from other clinician and parent measures (1.13 to 1.40). As the SRM is based on simple summary statistics, PROC UNIVARIATE provided the information necessary for the computation of the SRM.

Responsiveness of an instrument may also be assessed using correlations with a global improvement scale. The global improvement question usually asks the subjects to ratethe improvement on their condition on an ordinal scale. An example of such scale would be "very much better', "much better', "a little better", "no change", "a little worse", "much worse" and "very much worse". Mean changes in the instrument scores are calculated for subjects in each global scale response category regardless of the treatment. A monotone increasing (or decreasing, depending on the direction of improvement) function of the mean scores is a desirable property for a scale that is responsive to treatment. An analysis of variance (ANOVA) can be performed to test the mean differences in the mean instrument scores among subjects in different global scale response categories. The model should include the change in the instrument scores as the dependent variables and the global scale response categories as the class variable. A responsive scale should be able to discriminate reasonably well between the response categories of a global rating scale.

13.3.4. Identifying Subscales

It is often desired to identify subscales, if a multi-item questionnaire in addition to its overall total score. Factor analysis is a technique commonly used to explore the existence of such subscales. It describes the covariance relationships among many items in terms of a few underlying, but unobservable variables called factors. Each factor may correspond to a subscale. If items can be grouped by their correlations and all items within a particular group are highly correlated among themselves but have relatively small correlations with items in a different group, it is feasible that each group of items represents a single underlying construct, or a subscale.

One approach is to come up with a priori subscale structure proposed by the experts in that field and test the appropriateness of that structure by using the confirmatoryfactor analysis. In this case, the experts should identify the subscales first and allocate the items to each subscale based on their opinion. Then a confirmatory factor analysis model can be used to test the fit of the model. In a confirmatory factor analysis model, the researcher must have an idea about the number of factors and know which items load on which factors. The model parameters are defined accordingly prior to fitting the model. For example if an item is hypothesized to load on a specific factor, the corresponding factor loading will be estimated and the loadings corresponding to this item on theother factors will be set to zero. After estimating the parameters, the fit of this model is tested to assess the appropriateness of the model. Fitting a confirmatory factor analysis model requires some more detailed knowledge about factor analysis. Details about confirmatory factor analysis and using the CALIS procedure to fit the model can be found in Hatcher (1994).

Another way is to use an exploratory factor analysis to identify the number of subscales and the items that will be allocated to each subscale. As oppose to the confirmatory factor analysis, the researcher is usually is unsure about the number of factors andhow the items will load on the factors. After identifying the subscales, the experts should approve the face validity of the subscales. Face validity is not validity in technical sense. It is not concerned with what the test actually measures, but what it appears superficially to measure. For example, a psychiatrist reviewing the items loaded on a "Sleep Problems' subscale of a depression symptom questionnaire to see if any of those items appear to be related to measuring sleep disturbances on a depressed patient).

Exploratory factor analysis usually involves two stages. The first is to identify the number of factors and estimate the model parameters. There are several methods of estimation of the model parameters. The most commonly used ones are the Principal Component and Maximum Likelihood methods. As initial factors are typically difficult to interpret, a second stage of rotation makes the final result more interpretable. A factor rotation is carried out to look for a pattern of loadings such that each item loads highly on a single factor (subscale) and has small to moderate loadings on the remaining factors. This is called simple structure. Orthogonal and oblique factor rotations are two types of transformations may be needed to achieve the simple structure. Orthogonal rotations are appropriate for a factor model in which the common factors are assumed to be uncorrelated and oblique to be correlated. The type of transformation (orthogonal versus oblique) can be decided by using a graphical examination of factor loadings(whether a rigid or nonrigid transformation make the factor loadings close to the axes). For detailed information about factor analysis, see Johnson and Wichern (1982) and Hatcher (1994).

EXAMPLE: Incontinence Quality of Life questionnaire

Incontinence Quality of Life questionnaire (I-QOL) is a validated incontinence-specific quality of life questionnaire that includes 22 items (Wagner et al., 1996; Patricket al., 1999). The I-QOL yields a total score and three subscale scores: Avoidance and Limiting Behaviors, Psychosocial Impacts and Social Embarrassment. For simplicity, we selected 9 of the items in two subscales to demonstrate how an exploratory factor analysis can be used to define two of the three subscales. The data were obtained from a pre-randomization visit of a pharmacotherapy clinical trial for treatment of women with urinary incontinence.

Program 13.6 uses PROC FACTOR to perform an exploratory factor analysis of the I-QOL data (the IQOL data set used in the program can be found on the book's web site). PROC FACTOR can be used to fit a liner factor model and estimate the factor loadings. The SCREE option plots the eigenvalues associated with each factor to help identify the number of factors. An eigenvalue represents the amount of variance that is accounted for by agiven factor. The scree test looks for a break or separation between the factors with relatively large eigenvalues and those with smaller eigenvalues. The factors that appear before the break are assumed to be meaningful and are retained. Figure 13.1 produced by Program 13.6 indicates that two factors were sufficient to explain most of the variability. Although the number of factors can be specified in PROC FACTOR for the initial model,this is just an intuitive feeling and this can be changed in subsequent analyses based on empirical results. Even though the NFACT option specifies a certain number of factors,the output will still include the Scree plot and the eigenvalues with the number of factors extracted being equal to the number of variables analyzed. However, the parts of theoutput related to factor loadings and the rotations will only include the number of factors specified with the NFACT option.

The ROTATE and REORDER options are used to help interpret the obtained factors. Inorder to achieve the simple structure, the VARIMAX rotation was carried out in this example. The REORDER option reorders the variables according to their largest factor loadings. The SIMPLE option displays means, standard deviations and the number of observations.Upon examination of the Scree plots, eigenvalues, and the related factors, the number offactors in the NFACT option should be changed to the appropriate level and then the model should be re-run. This examination may include the Scree plot, eigenvalues, and the rotated factors together. In some instances, it could be desirable to keep factors with eigenvalues less than 1 if interpretation of these factors makes sense. The maximum likelihood method (METHOD=ML) is useful since it provides a chi-square test for model fit. However, as the test is a function of the sample size, for large studies the test may reject the hypothesis of sufficient number of factors due to differences that are not clinically relevant. Therefore it is recommended to utilize other goodness-of-fit indices (Hatcher, 1994).

Example 13-6. Subscale identification in the I-QOL example

proc factor data=iqol simple method=ml scree heywood reorder rotate=varimax nfact=2;
    var iqol1 iqol4 iqol5 iqol6 iqol7 iqol10 iqol11 iqol17 iqol20;
    ods output Eigenvalues=scree;
data scree;
    set scree;
    if _n_<=9;
    number=_n_;
* Vertical axis;
axis1 minor=none label=(angle=90 "Eigenvalue") order=(-1 to 11 by 2);
* Horizontal axis;
axis2 minor=none label=("Number") order=(1 to 9);
symbol1 i=none value=dot color=black height=3;
proc gplot data=scree;
    plot eigenvalue*number/vaxis=axis1 haxis=axis2 vref=0 lvref=34 frame;
    run;
    quit;

Figure 13-1. Scree plot in the I-QOL example

Example. Output from Program 13.6

Initial Factor Method: Maximum Likelihood

                            Factor Pattern

                                        Factor1         Factor2

       IQOL6       IQOL item 6          0.79447         0.10125
       IQOL11      IQOL item 11         0.77077        −0.16251
       IQOL17      IQOL item 17         0.76174         0.39284
       IQOL7       IQOL item 7          0.74458         0.38372
       IQOL4       IQOL item 4          0.69355        −0.46179
       IQOL5       IQOL item 5          0.69353         0.19886
       IQOL20      IQOL item 20         0.65887        −0.20804
       IQOL1       IQOL item 1          0.64754        −0.37380
       IQOL10      IQOL item 10         0.64487        −0.30917

                        Rotated Factor Pattern

                                        Factor1         Factor2

       IQOL4       IQOL item 4          0.80996         0.19553
       IQOL1       IQOL item 1          0.71412         0.22151
       IQOL10      IQOL item 10         0.66487         0.26344
       IQOL11      IQOL item 11         0.64271         0.45545
       IQOL20      IQOL item 20         0.60014         0.34238
       IQOL17      IQOL item 17         0.22891         0.82594
       IQOL7       IQOL item 7          0.22394         0.80715
       IQOL6       IQOL item 6          0.46518         0.65195
       IQOL5       IQOL item 5          0.32498         0.64414

Output 13.6 displays selected sections of the output produced by Progtam 13.6. The top portion of the output (under "Initial Factor Method: Maximum Likelihood") presentsthe factor loadings after fitting the model with two factors (NFACT=2). All items have high loadings on the first factor and small loadings on the second suggesting a rotation (each factor retained should represent some of the items). The bottom portion of Output 13.6 (under "Rotated Factor Pattern") displays the factor loadings after the VARIMAX rotation. The transformed factor loading structure suggests that the first factor (subscale) should consist of Items 1, 4, 10, 11 and 20 since those items areheavily loaded on Factor 1. Similarly the second subscale should consist of Items 5, 6, 7, and 17. At this stage, the researcher should decide if the items that load on a givenfactor share some conceptual meaning and if the items that load on different factors seem to be measuring different constructs. In this example, the first subscale was called as Avoidance and Limiting behavior and the second one was called Psychosocial Impacts in the original version of the I-QOL subscale creation.

13.3.5. Minimal Clinically Important Differences

Another important need is to determine the between- and within-treatment minimum clinically important differences (MCID) for an instrument. The MCID helps clinicians interpret the relevance of changes in the instrument scores. The within treatment MCID is defined as the improvement in a score with treatment at which a patient recognizes that she/he is improved. The between-treatment MCID is the minimum difference between two treatments that can be considered clinically relevant. One widely accepted way to determine the MCIDs is to anchor the scale to a global rating of change scale such as the one mentioned in the responsiveness discussion. The mean change in the measure of interest for those subjects who rated as "a little better" could be considered as the within-treatment MCID. The difference in the mean changes for subjects who rated as "a little better' andwho rated "no change" could be considered as the between-treatment MCID. The between-treatment MCID can be a sound choice for the treatment difference in order to power the clinical studies. These two MCIDs provide guidance to researchers to interpret the change scores for the instrument. They become critical when statistically significant differences needed to be justified as clinically relevant. The choice of a global scale is an important step in determining the MCIDs. Global scales with items that are less sensitive tochange may yield larger MCIDs. For example if the responses to a global scale of improvement are better, "no change," or "worse," the MCIDs calculated using this scale may be larger than those calculated from the global scale in the previous example. Therefore, the differences could still be clinically important but may not necessarily be minimum using this scale.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Validity and other topics

Create new playlist

Sign In

Sign Up