Step 8: Assess Basic Aspects of Validity—Does the Instrument Measure What It Claims to Measure?

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

STEP 8:
ASSESS BASIC ASPECTS OF VALIDITY—DOES THE INSTRUMENT MEASURE WHAT IT CLAIMS TO MEASURE?

The next step is to look for evidence of validity. In the instrument-development process, this is the process of reality testing. It is here that the developer begins to see whether the model holds up for real managers in real jobs.

In other words, validity is integrity. If you do not know for sure whether an instrument measures what it says it measures and you do not have evidence that scores are related to effectiveness on the job, you will not know how to interpret data from that instrument. If you have not seen evidence that managers can improve their performance, you cannot know whether the instrument’s scales represent areas in which managers are able to develop themselves.

These considerations are especially important because managers typically will use the feedback to set performance goals or to try to change their behavior, and most will take what others say very seriously. Established validity is also helpful in dealing with managers who receive very negative feedback from others. These managers may, to maintain their self-respect, attack the integrity of the instrument by arguing that it assesses areas that don’t make a difference or are areas over which the manager has little control.

Studying the validity of 360-degree-feedback instruments involves three separate, but related, areas. The first is determining whether the instrument actually measures what it is intended to measure. For instance, does it measure leadership rather than intelligence or power? The second is seeing whether what is measured really makes a difference. Are higher scores actually related to greater effectiveness in real organizations? The third is evaluating whether what is measured can change as a result of training, feedback, or other developmental experiences. Can the instrument’s scales be shown to represent domains that are amenable to development?

Look in the technical materials for a section on validity. There may be several sections labeled “Construct Validity,” “Concurrent Validity,” and the like. One does not have to be thoroughly versed in the different aspects of validity to get enough information about the extent of validity evidence for the instrument. (The interested reader can find a discussion of the meaning of the different aspects of validity in the “Additional Considerations” section below.)

The modern view of validity, in fact, is that the different “types” of validity are actually aspects of a single concept—construct validity. Construct validity indicates what meaning can be attached to the scores from an instrument. Rather than examining the several, often overlapping types of validity, one should focus on the main inferences that underlie the use of an instrument. And by use we mean both the use for which it was intended and the use you plan to make of it. The typical assumptions underlying the use of instruments for management development are, as indicated above, that the instrument is measuring what it says it measures, that scores are related to effectiveness, and that the constructs measured are amenable to change.

Look in the validity section of the technical materials for studies that compare scores on the instrument to scores on another validated instrument (that is, an instrument whose psychometric properties are known). Favorable results in this type of study will provide some evidence that the instrument measures what it was intended to measure. Another kind of study to look for would be one that tests hypotheses about observable behaviors. For example, the instrument developer may hypothesize that certain scales on the instrument will be related to certain personality characteristics. Other types of studies that can test this assumption are described below in the “Construct Validity” section.

The next underlying assumption to be tested is that scores on the instrument are related to important aspects of managerial effectiveness on the job. The typical way to test this hypothesis is to compare scores on the instrument to some other data relating to effectiveness or competence. These data, if collected at the same time as the ratings on the instrument, reflect current effectiveness. Raters may be asked to rate the target manager on a different effectiveness measure; current performance-appraisal data may be collected; or data may be obtained that relate to current employee or workgroup productivity. In addition, effectiveness data or data on promotions may be collected after some time has elapsed, so that we can know whether performance can be predicted by scores on the instrument from an earlier time.

Look for any information about studies that relate scores on the instrument to another measure of effectiveness on the job. This other measure could be another instrument that has been completed on the manager, or it could be a rating on effectiveness done by the manager’s boss or co-workers. In a correlational study, look for correlations that are at least moderate in magnitude (.3 or higher). Or look for a study that has compared instrument scores for groups of managers considered to be performing effectively or ineffectively—high performers or low performers. In this case, look for a result that shows that scores on the instrument were able to predict membership in the high and low groups (that is, scores of managers in the high group were greater than scores of managers in the low group).

The third type of study to look for is one that shows that the concepts measured are amenable to change as a result of training, feedback, or other developmental experiences. This is the most difficult study to find, because very few instrument developers have done this research. Studies of change resulting from feedback are very scarce.

One additional issue in examining validity is the sample used in any of the studies. How well do the characteristics of the sample reflect those of your target population? If the sample was primarily white males and your target audience will be more diverse, check to see whether gender or ethnic differences were found in the research on the instrument in question.

If you plan to use an instrument internationally, issues to be concerned with include the quality of translation, the availability of international norms, and evidence of reliability and validity for international samples.

Translating an instrument should be a carefully executed, multistage process, including both translation from English to the other language, as well as independent translation from the other language back into English. The use of back-translation ensures that meaning has not been changed in the process. Careful attention should be paid to idioms during this process. For example, “putting out fires” is common jargon among North American managers but has a different, more literal interpretation in both the nonmanagement arena and in areas outside the United States.

It is not always the case that an instrument translated for international use includes translated feedback and development materials. Many instrument vendors have translated only the instrument itself but not the feedback report or development guide. Products such as this can be used for English-speaking managers whose raters are not fluent in English. They cannot, however, be used for managers who themselves are not fluent in English because they will be unable to read their feedback. Thus, if your target managers do not speak English, check thoroughly the completeness of translation for all materials related to the instrument.

Finally, instruments developed from research, theory, or experience with North American managers must undergo further testing with the population for whom the instrument will be used. At a minimum, norms should be available for each language in which the instrument is translated. To compare North American managers’ scores to scores of managers from other cultures, look for statistical evidence supporting score equivalence. This would include factor analyses, scale alphas, and mean comparisons for each intended population. Vendors who have substantial databases should have statistical evidence, such as differential item functioning (DIF), item response theory (see, for example, Ellis, 1989), and logistic regression (LR) (see, for example, Swaminathan & Rogers, 1990) supporting the equivalence of the questions. Also, cross-cultural validity studies should be undertaken to verify that scores on the instrument are related to effectiveness in the other cultures.

The pace at which instrument publishers are translating instruments for international users has increased, and it is not uncommon to find instruments that have been translated into more than one language. It is, however, difficult to find North American-developed instruments that have been validated for the population for which they’re translated.

So a cautious approach is warranted when selecting North American-developed instruments for non-English-speaking managers. Look for one that has been carefully translated and which has been studied with respect to cross-cultural differences in norms, factor structure or scale reliabilities across different groups, and whether scores are related to effectiveness in similar ways across different cultures.

Keep in mind, however, that no instrument is ever proven valid once and for all. Validity is an ongoing process that involves collecting evidence over time in varying situations for different populations. In fact, any time the intended use of an instrument departs significantly from the assumptions underlying validity research to date, more research needs to be done.

It is useful to think of validity as a continuum ranging from no evidence that the inferences underlying the instrument are valid, to some evidence, to a good deal of evidence among varying groups and in different contexts. In general, the number of validity studies done should correspond roughly to how long the instrument has been around. Newer instruments will have fewer studies reported, but all instruments that offer feedback from co-workers should have at least one study completed that indicates the instrument is measuring what it claims to measure, and one that shows that its scores are related to effectiveness.

Additional Considerations: Ways of Defining Validity

According to psychometricians and statisticians, there are several aspects of validity: construct, concurrent, predictive, and content.

Construct Validity. A construct is an abstract variable that cannot be directly measured. Temperature and humidity are not constructs; but leadership, resourcefulness, and integrity are.

Construct validity refers to the ability of the instrument to measure phenomena that are hypothesized to exist but for which we can have no direct measure. All the different “types” of validity (content, concurrent, predictive) are actually aspects of construct validity. Every validity study represents a test of hypotheses about relationships between a set of constructs. Some constructs are narrow, in the sense that they may be completely described by a few behaviors. Other constructs are so large or broad that they may never be perfectly defined by even a very large number of behaviors. Broad constructs, like leadership, are usually measured using only a subset of all possible behaviors.

Thinking about construct validity begins with thinking about the theory, if any, that has been used to specify the meaning of the construct; how the construct should be related to other constructs; and how it should be related to specific observable behaviors. One way construct validation research can proceed is by testing hypotheses generated about observable behaviors. Another strategy may be to compare scores on the target instrument to scores on a similar instrument whose psychometric properties are known.

Because constructs cannot be directly measured, another way to approach construct validity is through a multimethod study. For example, if we are measuring more than one construct (resourcefulness and integrity) and can measure them in multiple ways (paper-and-pencil and rater observations), then we can discern how much of the correlation between these two constructs is due to the measurement method and how much is due to overlap in the constructs themselves. To obtain evidence of construct validity in our example, the correlation between the paper-and-pencil and rater observation measures of resourcefulness should be greater than the correlation between resourcefulness and integrity using the same measurement technique. Convergent validity is the label given to the condition of a high correlation between the same construct measured by two different methods.

Concurrent and Predictive Validity. These two aspects of validity, known collectively as criterion-related validity, have to do with the relationship of the scores to some other attribute, called a criterion variable. With 360-degree-feedback instruments, we are usually concerned with criteria that measure managerial performance or effectiveness.

Both concurrent and predictive validity are temporal in nature. Concurrent validity has to do with the relationship between scores on an instrument and performance or effectiveness measured at approximately the same time the instrument was completed. The relevant question is: How accurate is the instrument’s assessment of current performance or effectiveness? Predictive validity has to do with the relationship between scores on the instrument and performance or effectiveness sometime in the future. (Though some people use the term predictive to cover any study that looks at the relationship with effectiveness, we will limit our use of this term to mean only those studies that relate current scores to future performance.) The relevant question here is: How accurately can the instrument predict future performance or effectiveness?

Both concurrent and predictive studies are relevant for instruments used for management development, because in helping managers become more effective we are interested in both current effectiveness and effectiveness over the long term.

Criterion issue. In studies of both these types, the nature and measurement of the criterion variable is a key issue. Examining the criterion measure employed to validate an instrument can be as important as examining the study’s findings.

Remember that all leadership and management instruments are intended to assess factors associated with success or effectiveness. But success can be thought of as upward movement in one’s career (that is, promotion in the organization), or it can be thought of in terms of effectiveness as a leader or manager (that is, how satisfied others are with the individual’s performance, how well the manager’s direct reports perform, measures of profit-and-loss, and the like). These two phenomena are not necessarily the same (Luthans, 1988). Understanding how success has been defined in the research on an instrument and knowing how that compares with your own intended use of it is key to interpreting the results of any validity study.

According to Dalton (1996), the main, and only really appropriate, use of data from a 360-degree-feedback instrument is for management development or assessment for development. Most instruments of this type have not been developed or extensively tested for other uses (for example, selection or promotion). These instruments are basically dealing with people’s perceptions of behavior or competence, rather than any completely objective measure of performance (if one exists). An individual or organization planning to use an instrument of this sort for selection, promotion, pay, or performance purposes is in treacherous territory and will need to become aware of the relevant professional standards (SIOP, 1987) and government guidelines (Mohrman, Resnick-West, & Lawler, 1990), which are not covered in this report.

For management development, the main inferences to be drawn are that (1) high scores are related to effective current performance and (2) the competencies represent areas managers are able to develop.

Finding an appropriate criterion for a study of the first inference would involve choosing a definition of effective performance: If it is defined as the boss’s perception of the individual’s performance, the criterion variable might be performance-appraisal data; if it is defined by direct reports’ perceptions, the criterion might be “effectiveness” as rated by direct reports; and if it is defined as success in the organization, the criterion might be promotions. Variables less directly related to managerial effectiveness, such as bottom-line business profitability, may not be the best criterion measures. Businesses can be profitable for reasons other than good management, and businesses can fail despite good management.

A useful study of the second inference would show that managers are able to improve their scores as a result of training, feedback, or other developmental efforts on their part. As mentioned earlier, this type of study is more difficult to find.

Choosing the criterion variable for a validity study is one of the more problematic but critical phases of the research. The choice should be made with reference to the main inference(s) to be drawn from the instrument scores. In general, criterion data from an independent source (that is, a source different from the person or persons providing ratings on the instrument) are usually considered better data because using an independent source reduces the bias that occurs when both pieces of data come from the same person.

Reliability of criterion measure. A good criterion measure should have a high degree of reliability. In other words, whatever it measures should be constant over time. When effectiveness ratings are used as the criterion measures, they are subject to some of the same reliability concerns as the instrument itself. For example, raters’ judgments about the overall effectiveness of their leaders may be unstable over short periods of time. Although this issue is an important one to think about in evaluating the quality of a validity study, it may be difficult to find such research because few instrument developers have taken the time to study the reliability of their criterion measures.

Content Validity. Content validity has to do with the degree to which the items are a representative and comprehensive measure of the phenomenon in question, in this case leadership or management skills. For example, a 360-degree-feedback instrument should cover areas that are important for effective performance for the job in question. Often this information can be derived from the theory or model on which the instrument was developed, or from the experience of the instrument developer. Content validity is often more a matter of professional judgment than statistical analysis.

There Is No Perfect Instrument: Balance Is the Key

Remember throughout this process that there is no perfect instrument—no one instrument that is perfect for everyone, no one instrument that will always be ideal for you, and no one instrument that can’t be improved. As solid and powerful as some are, each feedback tool has its weaknesses as well as its strengths.

The key to good instrument selection is balance—knowing what you need and seeking a balance between your needs and quality standards. You may have already made some conscious trade-offs, and before this selection process ends you may make some more—sacrificing one good to get another. But to maximize the value of your choice—to most effectively use the instrument you choose—you need to be aware of the trade-offs you have made—what you have lost and what you have gained.

Keep in mind that no instrument is proven valid once and for all. Validation is an ongoing process. Several good validity studies defining effectiveness in varying ways are stronger than a single study or several studies that define effectiveness in the same way.

Now that you have reviewed the origin of items and feedback scales and have examined the evidence of reliability and validity for the instruments, you have available one or more instruments that you believe have been constructed according to sound psychometric principles. At this point you can begin to evaluate and compare instruments in terms of their potential impact—what kind of feedback will be provided and how that will be delivered.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Step 8: Assess Basic Aspects of Validity—Does the Instrument Measure What It Claims to Measure?

Create new playlist

Sign In

Sign Up

Table of Contents for
Step 8: Assess Basic Aspects of Validity—Does the Instrument Measure What It Claims to Measure?