13.1. Introduction

Adequate clinical trials require the use of valid and reliable measurements. In general, a measure is said to be valid if it accurately reflects the concept it is intended to measure. A measure is said to be reliable if repeated assessments of a stable subject or population using the measure produces similar results. The establishment of the reliability and validity of a measure is not a single test, rather a summary of its psychometric properties using multiple approaches.

Section 2.2.2 of the ICH E9 guidelines emphasizes the importance of utilizing primary measures that are both reliable and valid for the population enrolled in a clinical trial. Specifically, "there should be sufficient evidence that the primary variable can provide a valid and reliable measure of some clinically relevant and important treatmentbenefit in the population described by the inclusion exclusion criteria." In addition, the guidelines state that when rating scales are utilized as primary variables, it is especially important to address such factors as validity, inter- and intra-rater reliability, and responsiveness.

Regulatory guidance is not the only reason to utilize measures with adequate reliability and validity. Both Kraemer (1991) and Leon et al. (1995) studied the relationship between reliability and power to detect treatment differences. Kraemer assessed test-retest reliability while Leon et al. focused on internal consistency.Both demonstrated that large increases in power are possible (with a fixed sample size) by using a scale with greater reliability as opposed to a measure with low to moderate reliability. Faries et al.(2000) showed that the impact of using a unidimensional rather than a multidimensional scale for depression clinical trials results in the need for approximately one-third fewer subjects. Due to overlapping symptoms in some disease states,using a scale with discriminant validity is critical to making claims of efficacy for a particular disease state. In summary, use of a measure without established reliability may prove unnecessarily costly (result in a larger trial than necessary or a failed trialdue to reduced power), and using a measure without appropriate validity can severely aftect the credibility of the results.

In drug development, one is often faced with the challenge of validating a measureor selecting a measure based in part on their psychometric properties. For instance, onemay wish to use a scale that is validated as a patient-rated scale as an observer-rated scale. In addition, one may need to create a composite variable, use a subscale, or wishto develop a new rating scale with potential advantages over existing measures. In each of these scenarios, the statistician is faced with the challenge of quantitatively demonstrating the reliability and validity of the outcome measures. Even when appropriate validation for a measure has been established, often one is interested in documenting inter-rater reliability for the group of individuals who will be conducting a specific experiment or trial.

In this chapter we review some basic approaches for assessing the reliability and validity of an outcome measure and provide several numerical examples using SAS. The development of a rating scale typically starts with literature reviews, focus groups, and cognitive interviews to help develop and refine potential items. However, the developmentstage of an outcome measure is beyond the scope of this chapter. For an example of the development of a new scale, see Cappelleri et al (2004).

To save space, some SAS code has been shortened and some output is not shown. The complete SAS code and data sets used in this book are available on the book's companion web site at http://support.sas.com/publishing/bbu/companion_site/60622.html

13.1.1. Other methods

The following reliability/validity methods are not discussed in detail in this chapter, but are provided here as a reference. For cases when a value can be specified for which a difference of less than the value is clinically non-relevant, Lin et al.(2000) discusses the use of a total deviation index and coverage probabilities relative to the concordance coefficient (CCC). Evans et al. (2004) provides an example of using item response theory to identify items from the Hamilton Rating Scale for Depression (HAMD) withpoor psychometric properties. Beretvas and Pastor (2003) summarize the use of mixed-effect models and reliability generalization studies for assessing reliability. Lastly, Donner and Eliasziw (1987) discuss sample size guidelines for reliability studies.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.60.249