STEP 7:
FIND OUT HOW CONSISTENT SCORES TEND TO BE

Information for this step and the next will again be found in the technical material for the instrument. First, look for a section on “Reliability,” or sections on “Test-retest Reliability,” “Internal Consistency,” and “Interrater Agreement.”

Basically, reliability is consistency. There are three aspects to reliability:

•  homogeneity within scales,

•  agreement within rater groups, and

•  stability over time.

Without evidence of reliability, we do not know whether the items and scales of an instrument are good enough to hold up even under stable conditions with similar raters. In other words, we do not know whether items that are grouped together in a scale are measuring the same competency (homogeneity within scales); we do not know whether groups of raters who bear the same relationship to a manager tend to interpret the items similarly (agreement within rater groups); and we do not know whether the meaning of items is clear enough so that a single rater will rate a manager the same way over a relatively short period of time (stability over time).

Homogeneity within Scales: Basic Issues in Internal Consistency

Homogeneity within scales is called internal consistency. This type of reliability applies to the scales on which feedback is to be given, rather than to the individual items to which raters respond. Reliability is assessed using a statistic called the correlation coefficient, which indicates the degree of relationship between two measures. Correlation coefficients can range from −1 (perfect negative relationship) to +1 (perfect positive relationship). A correlation of zero means there is no relationship between the two measures of interest.

Internal consistency measures are based on the average correlation among items and the number of items in the scale. Fundamentally, it asks whether all the items that make up a single scale are, in fact, measuring the same thing, as their inclusion in a single scale suggests. Managers who exhibit one of the behaviors that defines the scale should also exhibit the behaviors described by other items on that scale. If this correlation is low, either the scale contains too few items or the items have little in common.

Though several statistical procedures exist for testing internal consistency, Cronbach’s alpha is the most widely used. An alpha of .7 or higher is generally considered to be acceptable. Low reliabilities are often the result of items that are not clearly written.

It should be noted here that the interpretation of reliability coefficients (how high they should be) is a subjective process. Although we provide rules of thumb for deciding whether a particular type of coefficient is high or low, there are many issues involved in making such interpretations. For example, factors such as the number of items on a scale and what skills are being measured by a scale can affect its reliability. So it’s probably a good idea to compare the reliability coefficients reported for several similar instruments before deciding what level of reliability seems to be acceptable.

Agreement within Rater Groups: Basic Issues in Interrater Reliability

Agreement within rater groups is commonly called interrater reliability. This concept assumes that two or more direct reports, peers, or supervisors, with an adequate opportunity to observe a manager’s behavior, should tend to agree on the level of performance of that manager, given reliable items and scales.

Interrater reliability does not tend to be very high for 360-degree-feedback instruments for several reasons. First, the raters are typically untrained. Trained raters may produce higher reliability by using a common evaluative framework. The people who usually complete these instruments, however, often rely on nothing more than their own perceptions, and perceptions tend to vary. Second, a group of direct reports may be managed differently by their boss. Their within-group reliability may be relatively low because the target manager actually interacts differently with each of them. Third, some behaviors may elicit higher interrater reliability than others because they are easier to observe and rate. And fourth, interrater reliability among peers may be lower to the extent that managers interface differently with different kinds of organizational peers.

Again, this type of reliability coefficient does not tend to be very high for 360-degree-feedback instruments; it should not, however, be extremely low. Look for correlation coefficients, within specific rater groups (that is, peers, direct reports), that are at least .4.

Additional Considerations

It is not uncommon to find instruments that have not been tested for interrater reliability. There is some disagreement about whether this is necessary or useful for multiple-perspective instruments, because the perspectives taken by various raters (peers, boss, direct reports) will differ according to the organizational relationship they have with the target manager. The argument is that these differences in organizational perspective will be reflected by artificially low interrater correlations.

Yet careful analysis can be, and has been, done on reliability within the different perspectives. When approached this way, there should be higher reliability within rater groups than between them. For example, a single group, say peers, should show higher within-group reliability than do peers as compared to direct reports. This approach recognizes both the desirability of reasonable reliability among raters of the same type and the reality that disagreement exists between raters of different types.

In looking at reliability within rater groups, one needs to be wary of very high correlations, because they may be an indication of raters’ overall bias toward the individual (the so-called halo effect), instead of a clear reflection of the individual’s performance. Therefore, as a rule of thumb, within-group reliability should not be lower than .4 or much higher than .7.

Stability over Time: Basic Issues in Test-retest Reliability

Stability over short periods of time is called test-retest reliability. It is studied in relation to both items and scales. An item or scale should be stable enough to produce a similar response this week and next week, given that no significant event, such as training, has intervened. If the meaning of an item is unclear or debatable, it may produce varying responses because of this ambiguity. Low test-retest results indicate the item should be revised and retested or dropped from the scale.

Look in the technical materials for test-retest coefficients of at least .4, which is minimally adequate, with .7 to .8 quite high. This coefficient tends to be low for items on 360-degree instruments because others’ perceptions of behavior or competence may be unstable. Yet it is still important that the developer has weeded out of the instrument those items with very low stability over time.

Additional Considerations

A few instrument developers argue that test-retest reliability is inappropriate for 360-degree-feedback instruments. These developers argue that the simple act of completing a questionnaire will affect responses on a retest, and that an instrument designed to stimulate change shouldn’t be assessed on these criteria. Although both of these assertions may be partly true, neither negates the need to examine test-retest correlations to look at basic item and scale stability.

Responding to a set of test items may, in fact, alert respondents to dimensions of managerial effectiveness of which they had previously been unaware. As a result, respondents may rate a manager somewhat differently on the second rating opportunity, having now had time to observe the desired behavior. This may be another reason why test-retest correlations do not typically turn out higher than .6 or so for 360-degree-feedback instruments.

Adequate test-retest reliability is especially important if you expect to use an instrument before and after training or over time to measure program impact or individual change. If test-retest reliability has not been demonstrated, you cannot know whether changes in scores over time represent real changes in performance or are merely the result of instability in the instrument itself.

In order to make a meaningful statistical comparison, the appropriate way to conduct test-retest reliability studies is to compare data from two administrations of the instrument that are completed three to six weeks apart, with no intervening feedback or training. With this method, study participants probably won’t remember their initial ratings when they complete the instrument for the second time.

Although completing the form may influence the way raters observe managers or influence the way managers rate themselves, managers are not likely to change significantly merely by completing the instrument. Really significant behavioral change is quite difficult to achieve, usually requiring both powerful feedback and intense follow-up.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.190.102