5
Design of a Measurement Scale

5.1. Introduction

In the previous chapter, we discussed the main steps in building a multiitem reflective measurement scale. However, this is not sufficient to reflect all the decisions to be made in this process because these steps specify the general procedure to be followed, but do not say anything about the formulation of items that may be better than others or about the response formats associated with the scale that may be more efficient in the processes of data collection and analysis, etc. The quality of an instrument is largely determined by the number of items, the labels (words) used, the formulation of the items, the number of response modalities, the numbering associated with them, the existence or not of response labels and/or numbers, the type of scale (Likert, semantic differential, etc.), etc.

Reflections on the scale to be built require more than just a knowledge of the stages of progression through which the researcher must pass and the statistical knowledge to be applied. However, researchers often do not pay enough attention to the design of a scale, wrongly believing that purification and validation processes are able to identify all the limitations and shortcomings to be corrected. While statistics (reliability tests and factor analyses) make it possible to assess the psychometric qualities of a scale, they do not replace the researcher in judging the ability of a measurement design to get closer to the realities experienced by respondents, all while avoiding certain biases. This reflection must be carefully carried out from the beginning of the scale construction process, because once the fieldwork has begun, it is difficult to go back. While revisions are possible in order to make adjustments, they are likely to result in a loss of time and in additional effort.

Decisions relating to the design of a measuring scale do not exclusively concern the development of new scales, but also the adaptation of an existing scale which can effectively cover all aspects (response formats, labels, deletion and/or addition of items, etc.). In addition, many of the considerations we will discuss in this chapter (e.g. the formulation of indicators, the number of response modalities, etc.) do not relate exclusively to reflective measures, but also to formative specifications. In this chapter, we will review some of the components of a scale, particularly in terms of item formulation, numbers, response formats (number and labels) and the presentation of some of the attributes of a scale.

5.2. The quantitative attributes of a measurement scale

Often, the researcher is in doubt about the number of items to be included in a scale. These items must also be assessed on the basis of a number of response modalities to cover respondents’ perceptions or opinions. These two quantitative attributes (number of items and number of response choices) of a scale involve a set of devices to determine an optimal number.

5.2.1. Optimal number of items

The idea that a single-item scale is sufficient to understand the less abstract constructs is sometimes accepted. However, this aspect is debatable and the use of scales with multiple indicators is the most frequent, despite the disadvantages they generate (cumbersome administration, high cost, etc.). Without going back to the discussions in Chapter 3 (section 3.3) on the choice of a single-item or multi-item measure, efforts should be made to adjust the length of the scale so that the advantages of the designed scale can be maximized. It should be remembered that seeking to arbitrate the number of indicators to be retained is a specific concern for reflective scales. In the following, we will first outline a few proposals for the number of items to be retained and in a second instance, we will review some criteria to facilitate the choice of this number.

5.2.1.1. A few proposals

Having summarized the contributions and limitations of single-item versus long multi-item scales, Malhotra et al. (2012) proposed a sequence to follow to search for the optimal number of items between these two possibilities, without specifying one in particular. On this point, literature is not very abundant. Akrout (2010, p. 112) points out that it is recommended to have three items to properly represent a latent variable. Diamantopoulos et al. (2012) propose the number of four items to make comparisons and test a measurement model. Bagozzi (2011) also retains this number of four items for a relevant and interesting estimate of a reflective measurement model, except that he adds that this is applicable for one-factor (unidimensional) models.

According to Bagozzi (2011), for models with two or more factors, at least two indicators per dimension are needed to avoid ambiguity. But he assumes that it would be better to retain three or more indicators. Mowen and Voss (2008, p. 499) “recommended that researchers set a goal of developing scales that have between four and eight items; further, if a scale has dimensions, each dimension should have from three to five items”. Hinkin et al. (1997) recommend generating, at the very beginning of the scale development process, twice as many items as necessary, that is four to six items for each construct, suggesting that only half of them will be retained in the next steps.

5.2.1.2. Selection criteria

It is clear that none of the various proposals mentioned constitute a rule to be adopted in absolute terms. When choosing the number of items to be retained, the researcher must take into account several considerations:

  • the step of the scale construction process: it is logical to retain many more items in the genesis phase, then, following the purification and validation steps, some will be eliminated. Fabrigar et al. (1999) propose a number of at least four items for each dimension conceptually planned to allow for a robust exploratory factor analysis (EFA). Hair et al. (2009, p. 100) recommend at least five items per factor to conduct an EFA;
  • respondent characteristics: the more time available to respondents and their cognitive abilities, the more the number of items can be increased. In some cases, it is desirable to have a large number of items, particularly for better reliability of the scale, but a large number of items support a larger sample size for adequate statistical tests, which is difficult to obtain when respondents are unavailable. Similarly, some respondents may have difficulty cognitively processing a large number of items;
  • the statistical processing envisaged: while some data analysis methods are more sensitive to the number of items, others have requirements in terms of the optimal number to be used. Thus, the number of items can only be determined with regard to the statistics envisaged and the constraints associated with them;
  • the construct to be measured: the more abstract the construct is, the more indicators are important for a better understanding of its content. But it is not a question of multiplying the indicators without respecting the validity of their content with regard to the definition of the construct;
  • the number of dimensions of the construct: the more the construct refers to a multitude of facets (dimensions), the higher the number of items on a scale in order to understand this diversity of dimensions;
  • the wording of the items: when a scale includes reversed items (positive and negative forms), the question of the number to be retained from each category is relevant to consider. It would seem more appropriate, whenever possible, to use a balanced scale with the same number of items for each polarity;
  • data collection methods: in some cases (example: administration via telephone), it is very difficult to question respondents at length on a large list of items. Other modes of administration, such as the face-to-face technique, are more flexible with the length of the questionnaire and therefore with the number of items on a scale.

The researcher’s judgment should preferably focus on all these determinants in order to optimize the number of items that will be definitively selected. Empirical data and psychometric properties sometimes force them to accept a non-optimal solution according to some of the methodological considerations mentioned above. It should be remembered, however, that it is possible to move back and forth between the different methodological decisions in order to harmonize the whole.

5.2.2. Number of response modalities

The debate here relates to the question of the number of points (response modalities) to be retained. Some support the choice of a large number, others a small number. The controversy is essentially engaged in the following dilemma: discrimination of responses versus response bias. In this section, the discussion will focus first on the results of research dealing with the consequences of different formats (number of points) and their determinants, then on proposals for the number of response modalities, and finally on some recommendations and criteria that may facilitate the choice of an optimal number of response modalities.

5.2.2.1. The number of points: different angles of analysis

We will review some research findings. Some support the merits of a large number of response modalities, others point out the disadvantages, while others present contrasting results:

  • advantages of a large number of response modalities. Faced with the need for nuanced information that can distinguish respondents’ attitudes, opinions and perceptions, some believe that a large number of modalities are likely to promote rich information and allow for the use of certain statistics. The increase in the number of points seems to improve the metric data, encouraging a wider use of analytical tools. The research found that:
    • - an increase in the number of modalities augments the standard deviation that probably leads to more information (Peterson 1997), increases variance and therefore reliability (Churchill and Peter 1984; Cook et al. 2001), increases the value of correlation coefficients (Martin 1973; 1978);
    • - an increase in reliability, generated by the characteristics of a scale (including response modalities), affects the estimation of some aspects of the validity of a scale (Peter and Churchill 1986).
  • disadvantages of a large number of response modalities. It seems that when the effort required of the respondent to locate their response on the scale is too great, there is a high probability of increasing the non-response rate or having biased information. This may be due to many causes such as respondent fatigue, inability to distinguish between modalities, etc. For example, some researchers do not support the practice of increasing the number of modalities for the following reasons:
    • - there is no relationship between reliability and the number of response modalities (Wilson 1995); for the determination of the number of points on a Likert scale, reliability and validity are not criteria to be taken into account (Jacoby and Matell 1971);
    • - improved data accuracy, facilitated by the increase in the number of points, may be limited by respondents’ cognitive abilities and their patience in considering several options for a longer response time (Cook et al. 2001).
  • contrasting results of the advantages and disadvantages associated with the number of modalities. Some research results partially confirm the previous findings, focusing on the advantages and disadvantages of the high number of response modalities, mainly at two levels:
    • - first, the accuracy of the information supported by a higher number of response choices does not necessarily increase respondent fatigue. Thus, the results of comparisons between 5- and 10-point satisfaction measures showed that the 10-point scale gives better convergent, discriminant and nomological validity, without higher response efforts required of respondents across the 10-point scale, and both formats (5, 10 points) produce similar non-response rates and average scores (Coelho and Esteves 2007). Similarly, Pearse (2011) observed that a 21-point Likert scale for measuring turnover does not generate fatigue on the part of respondents – this number seems rather interesting for respondents because it increases the variability of responses offered to them;
    • - second, a format with a small number of modalities may be simpler and clearer for respondents while not systematically providing lower data quality compared to another format with a wider range of choices. In this regard, Dawes (2008), following comparisons between a 5, 7 and 10 point Likert scale for an eight-item measure of the price-conscious construct, states that the results are partially dependent on the number of points chosen. While he noted that the 5-point scale seems simpler for respondents, he noted that none of the formats should be rejected from a data obtaining perspective. All three formats yield comparable results (regressions, confirmatory factor analyses and structural equations).

This clearly suggests that increasing the number of response modalities may not, in some situations, generate a greater wealth of information and, at the same time, may not systematically create a high burden during data collection and processing. How is it then possible to understand the contrast in these different results? In fact, discussions about the optimal number of response choices are made from two main perspectives: the researcher and the respondent. The observation of the debates under way clearly underlines that there does not seem to be agreement on an optimal number of modalities. It appears from the results of several investigations that this number can only be decided under certain conditions:

  • respondent characteristics: socio-demographic characteristics, respondent involvement in the phenomenon studied (Coelho and Esteves 2007), cognitive abilities to deal with several possibilities (Cox 1980), preferences in terms of ease and speed of use (Preston and Colman 2000);
  • the characteristics of the scale such as the number of items (example: Givon and Shapira 1984), the type of scale (Cook et al. 2001);
  • the nature of the phenomenon to be measured (Coelho and Esteves 2007);
  • the psychometric qualities of the scale, including reliability and validity (Cox 1980; Preston and Colman 2000).

5.2.2.2. The optimal number of points: some proposals

Starting from different angles of analysis, some proposals surrounding the number of measurement echelons can be put forward. Some start from a few criteria, including those mentioned above, others discuss the issue of whether or not to integrate a neutral position:

  • number proposals: Preston and Colman (2000) point out that overall a number of 7, 9 or 10 points seems satisfactory. Green and Rao (1970) recommend the use of at least 6 or 8 points. Lehmann (1972) suggests a number of 5 to 7. Revilla et al. (2014) suggest that 5 points gives a better quality of information than 7 or 11 points. With respect to the respondent information process, Chen et al. (2015) found that the cognitive effort required from respondents is lower with a 5- and 7-point scale compared to the other formats tested (4, 6, 8, 9 points). With reference to some possible biases that can be supported by the number of points, Moors (2008) found that extreme response styles can exist regardless of the 5- or 6-point scale format. According to McKelvie (1978), 5 to 6 points should be used, adding that there are no psychometric advantages to exploiting a high number of points (more than 9 to 12) and that there is less discriminating power and validity with less than 5 points. From the point of view of the psychometric qualities (reliability and validity) of a Likert scale, Lozano et al. (2008) recommend at least 4 and specifically suggest that the number of possibilities to be exploited should be between 4 and 7 points. Lee and Paek (2014) found that with scales of 4 to 6 points, the results are similar. A clearer decrease in psychometric qualities is recorded by comparing scales of 3 and 2 points. Eutsler and Lang (2015) note that 7 points maximizes variance. Malhotra et al. (2009) found that a 7-point scale provides better validity of criteria for bipolar constructs than a 3-, 5- or 9-point scale. Despite these differences on the optimal number of points, it is easy to observe the most agreement for the number of 7 plus or minus 2. Although this magic number (7 plus or minus 2) seems, according to Cox (1980), to be a reasonable choice, he notes that no number is appropriate for all situations and provides some recommendations to facilitate this decision: 2- or 3-point scales are unable to convey much information and tend to frustrate respondents, while the marginal gain obtained with more than 9 possible answers is minimal;
  • proposals for an even or odd number of choices1: several researchers (e.g. Garland 1991; Croasmun and Ostrom 2011) have specified that the use of a neutral point affects the data. Wilson (1995) recommends the introduction of a neutral point among the possible choices. However, Brée (1991) advised against such use in children’s populations (7 to 12 years old) because, in his opinion, the introduction of a central point on a symmetrical bipolar scale intellectually disrupts children. Adelson and McCoach (2010) find, on the contrary, that young people are not troubled by a 5-point format (integrating a neutral point) compared to a 4-point format (without a central point). For Cox (1980), an odd number of choices rather than an even number is preferable in situations where respondents tend to adopt a neutral position. Weems and Onwuegbuzie (2001) found that frequent use of the central point by respondents reduces the reliability of the measure. In the latter register, Voss et al. (2000) also found this tendency to affect the reliability of a scale. They pointed out that reliability is different depending on the scale (even or odd) used, but did not say which is better. Voss et al. (2000) state that they are uncertain whether an odd scale artificially inflates the value of Cronbach’s alpha or vice versa. The following example illustrates the two types of format (even and odd) for a Likert scale.

Faced with these contrasting results, should a central point be included in the scale and therefore have an odd number of response modalities, or would it be desirable to force the respondent to provide a response using a symmetrical scale with an even number of responses, in order to prevent them from resorting to the neutral category to hide their opinion? The answer to this question is dictated in part by the studied population and the researcher’s objectives. When some respondents may not have opinions to express on the subject, an odd number is preferred. However, if the researcher has reason to believe (but must clearly justify) that some respondents could potentially hide their true opinions (for example for a more “taboo” subject) and choose the neutral mode to avoid disclosing them, then an even number is to be favored. Finally, it should be noted that the purpose of the measure may favor an even or odd number of choices: for example, measuring the roles played by family members (parent or child) in a purchasing decision easily lends itself to a central position that conveys the idea of joint roles between members as opposed to separate roles taken at the initiative of one or the other.

5.2.2.3. The optimal number of points: resolution perspective

Regardless of the perspective used (that of the researcher or respondent), to choose the number of modalities, the benefits of a specific number are often thwarted by inherent limitations. This section will attempt to shed some light on this point by examining two main aspects: (1) the number of response modalities in a questionnaire, and (2) the criteria for choosing the number of response modalities in a scale.

5.2.2.3.1. Number of response modalities in a questionnaire

Whatever the recommended decision, it is desirable to harmonize the number of points on the scales within the same questionnaire when this decision does not alter the understanding of the items and the content of a scale. Indeed, it is not uncommon to observe an exploitation of different response formats (in terms of the number of points) in the same survey and among the same respondents, particularly when establishing nomological validity. This is likely to generate many difficulties:

  • – it could generate confusion in the respondent’s mind, probably leading to answers that are not in line with their attitudes if they are forced, when expressing their choice, to change referential because of a variable number of response options. The respondent’s cognitive adaptive skills may not allow for such mental calculations to reflect on the content of a different item perceived in a different format;
  • – it could generate data with different statistical characteristics, in terms of metric or non-metric properties. Although some statistics are able to process variables that have been captured on different formats in terms of the number of responses, such as structural equation methods (Hair et al. 2009, p. 673), other methods do not allow for this flexibility. Using different measures in terms of the number of points, for example to test dependencies between two constructs when checking the predictive validity of the scale, can lead to inconclusive or even inconsistent results.
5.2.2.3.2. Criteria for choosing the number of response modalities on a scale

The choice of a number of response points must be justified. To this end, efforts must be made to take into account the specificities of the research undertaken. Indeed, the number of modalities can only be decided after examining other aspects of the research design:

  • the number of items retained: the larger the number, the more desirable it would be to reduce the number of points in order to facilitate responses, reduce response time and reduce the non-response rate;
  • the characteristics of the population, under two criteria: first, the availability of respondents. Efforts should be made to reduce response time by choosing a small number of points when the population studied has strong constraints in terms of availability. Second, the cognitive abilities of the respondents. The more developed the individual’s cognitive abilities, the easier it is to exploit a large number of points;
  • the statistical tests envisaged: some statistical tests are more demanding in terms of the number of response options. It is therefore necessary to decide, before collecting the data, which statistics to use in order to retain an admissible number of points;
  • the type of research undertaken: during exploratory research (versus confirmatory), the choice of the number of points is more flexible. Since the objective is to test the scale on the response format and its consequences on the quality of the responses and also on the ease of completing the questionnaire, the use of one format rather than another is more tolerated, of course with regard to the statistical tests in view. Moreover, it is even useful to test several formats in order to decide which one is the most appropriate to choose later;
  • the mode of administration envisaged: it is easier to assist the respondent and reduce the ambiguity that could be generated by a large number of possible responses in the face-to-face mode of administration. However, the place where the questioning occurs may require a reduction in response time. For example, face-to-face administration in a public place may involve a reduction in the number of points to retain, as respondents are in a hurry, tired or even in an uncomfortable environment. Similarly, for the traditional telephone method, it is strongly recommended to reduce the number of points. Since the respondent does not have the opportunity to be in contact with the survey material (questionnaire), it is more difficult for them to understand and visualize the proposed response continuum. Similarly, for self-administration via a smartphone, De Bruijne and Wijnant (2014) note that the visibility of the number of points for respondents is very low with a format of 11 points, compared to a format of 5 or 7 points;
  • the nature of the phenomenon to be measured: some phenomena are more abstract than others, requiring not only a diversity of indicators, but also probably more variation in response possibilities covering all expressions of perceptions. When the construct is well-established, a reduced choice format is possible, or even a binary format (example: yes or no) instead of an interval scale. For other constructs, this is not possible and a multi-modality interval scale makes it possible to nuance the responses in order to better match perceptions. Indeed, a respondent may not just agree or disagree (yes/no), but rather have a varying degree of agreement or disagreement with an idea, item, opinion, etc.;
  • the homogeneity of the items: the closer the items appear to be (the more homogeneous), the more likely they are to be perceived in an undifferentiated way, thus if the number of points is larger, it may be interesting to have more variability in responses and avoid redundancy phenomena. If the items are clearly distinct (heterogeneous), fewer points can be retained;
  • the arrangement and presentation of the scale: apart from all the above-mentioned aspects, the arrangement of the scale on certain mediums, in particular on paper, may in certain situations force the number of points to be reduced. The very presentation of the scale (horizontal or vertical) may or may not allow for a particular number of points.

5.3. The verbal components of a measurement scale

Far from the considerations surrounding the number of items and the choice of answers to be used in a scale, it must be noted that there are many qualifiers (words, labels, formulations) that can be considered. The researcher must decide on the most appropriate content to facilitate the understanding and the adherence of the respondents in order to report their assessments of the phenomenon as accurately as possible. In this section, we will examine the verbal components of a scale. By verbal components, we refer to two main aspects: the formulation of items and the labels (if they exist) associated with the response modalities. These two components actively participate in shaping the design of a scale and largely determine the operational quality of a measure. In the first section, we will attempt to examine different item formulation formats and some of their consequences in order to identify a set of practical recommendations. In the second section, we will focus on the labels of the response categories proposed to respondents in order to express their attitudes about the items.

5.3.1. The formulation of the items

A critical look at the various results of the discussions undertaken on this subject is of great importance because, first, it makes it possible to highlight observations about the formulations of items to be retained, and second to formulate some recommendations so as to arbitrate the choice of formulations.

5.3.1.1. Research findings and a critical look at the formulation of items

Apart from the clear and precise considerations needed when choosing indicators for a construct, the formulation of the items has intrigued many researchers who are often confronted with the problem of whether or not to integrate reversed items. A major decision is required: should negative (reversed) and positive items be included in a scale? The answer to this question has been divisive and agreement on the need (or not) for such use seems to be lacking. Each format has potential interests but also many limitations. As such, the use of both formats, on the same scale, seems useful in order to allow better discrimination of items, while avoiding some response bias. By response bias, it is generally recognized that some responses do not reflect the opinions of respondents, disrupting the validity of a measure. This concerns in particular the tendency of an individual to systematically respond in the same way to items on a scale.

This trend gives rise to quite varied styles, such as the tendency to agree, the tendency to disagree, the tendency to use extreme responses, the tendency to use midpoint responses, etc., which may have several causes such as cultural specificities (Bartikowski et al. 2006; Roster et al. 2006; Dolnicar and Grün 2007; De Jong et al. 2008; Peterson et al. 2014) and the sociodemographic characteristics of the respondent (Greenleaf 1992; Peterson et al. 2014). These styles can affect the validity of research findings (Baumgartner and Steenkamp 2001). It is therefore crucial to pay close attention to this type of bias when adopting, adapting and building a measurement scale during the item formulation phase.

In this section, we will report on four main aspects associated with mixed items: (1) the types of item formulation, (2) the problems created by the introduction of reversed items, (3) the causes generating the problems of using mixed items, (4) the solutions recommended to solve the problems of using mixed items.

5.3.1.1.1. Types of item formulation

It is obvious to observe that a variety of item formulations are possible. Weijters and Baumgartner (2012) specify that an item can be formulated either as an affirmation or as a denial of something. Swain et al. (2008) note that since scales are often used to measure the existence of a phenomenon (behavior, attitude, etc.) rather than its absence, most regular items are formulated as affirmations while most reversed items are formulated as negations. Chang (1995) adds that an item can be negative, either by adding “no, etc.” or by a grammatical positive form (this building is poorly designed), however, a grammatically negative sentence (example: I am not sad) is not necessarily a negative item – some words and items have a negative connotation, while others have a positive connotation. Sonderen et al. (2013) observed that in general, in order to reverse an item, it is possible to consider two strategies: the first is to add a negation (example: no), the second is to use opposite words (example: large versus small). Swain et al. (2008) find that to reverse an item, it is possible to reverse its linguistic polarity, which refers to the fact that the item acquires either a positive polarity (an affirmation) or a negative polarity (a negation). Weijters and Baumgartner (2012) add that several types of negations are possible (verb negation, adjective negation, etc.) and reversed items can be formulated in different ways. As an illustration, they note that a reverse item can be defined as an item whose meaning is opposite to the standard of understanding chosen (construct, item, respondent). Faced with the different potential definitions of a negative (or even reversed) item, Chang (1995) notes that a positive or negative label implies a value judgment that is difficult to generalize across time and cultures. He proposes instead the terminology of a coherent or incoherent item with the majority of items on a scale.

5.3.1.1.2. Problems created by the introduction of reversed items

More generally, the objective of using mixed items is to alert inattentive respondents to the variations that exist between indicators, in the hope of circumventing some of the above biases. However, this way of proceeding (scale with mixed items) can lead to:

  • – confusion in the minds of respondents (Herche and Engelland 1996), especially when the scale is big (large number of items);
  • – low interrelationships between reversed items (versus regular items) generating low reliability (Weems and Onwuegbuzie 2001; Roszkowski and Soven 2010; Salazar 2015);
  • – inflated averages due to items with reversed scores (Alexandrov 2010);
  • – a complex, artificial, or even biased exploratory factor analysis (EFA), with reversed (negative) items generally (when used) grouped on an axis independently of other positive items (Herche and Engelland 1996; Spector et al. 1997; Roszkowski and Soven 2010). Schriesheim and Eisenbach (1995) even note that these different factor structures for different item formulation formats may vary depending on the rotation chosen (example: varimax, oblimin). According to Salazar (2015), when only positive items are used, the data is better adapted to the theoretical factor structure;
  • – non-equivalent confirmatory factor analyses (CFLs) for different formulations of items on the same scale (Schriesheim and Eisenbach 1995; Zhang et al. 2016).

Thus, some believe that the format of positive items (regular non-reversed) is superior to other formats, as differently formulated items are unlikely to result in consistent information (Schriesheim and Eisenbach 1995; Weems et al. 2003). Reversed (negative) items that are supposed to act as a cognitive slower in order to better control responses damage measurement. According to Sonderen et al. (2013), using mixed items is counterproductive because it increases confusion, cognitive fatigue and ultimately leads to more response bias without any significant benefit.

5.3.1.1.3. Causes generating problems with the use of mixed items

Attempts were made to understand the causes of problems when using mixed items (positive and reversed). As an indication, it would seem that:

  • – mixed items are a serious problem when used in different cultures where differences in values, languages and customs often lead to different interpretations of item types (positive or negative) (Wong et al. 2003). Chang (1995) specifies, in this regard, that the connotation (positive or negative) of an item depends on the context (temporal and cultural) in which the item is used. Problems with translation and equivalence of words used in the items can also lead to confusion. The disorder generated by mixed items, in some cultures, then seems to generate troubles about equivalence of measures that impede the validity of the construct and prevent successful intercultural comparisons (Wong et al. 2003);
  • – the positive and negative items are not really opposed (Barnette 2000; Alexandrov 2010), and this can generate other problems (dimensionality, reliability, etc.);
  • – mixed items are more problematic for some samples because of some of the respondents’ individual characteristics such as income, ethnicity, gender, age (Weems et al. 2003) and education level (Chen et al. 2007);
  • – reversed (mixed) items make the respondent’s task more complex and require more elaborate cognitive processing on their part to compare items to their behaviors and attitudes (Swain et al. 2008). Negative items require a longer cognitive process: they are read more slowly and reread more times than positive items are (Kamoen et al. 2017). However, some respondents may not be aware of the existence of some items that have meaning as opposed to others (Schmitt and Stuits 1985);
  • – some types of items inversion can cause problems, notably comprehension problems, in particular double negative items (Ahlawat 1985).
5.3.1.1.4. Recommended solutions to solve the problems of using mixed items

Is the validity of a scale improved:

  • – by using a mix of items formulated in a positive and negative (reversed) way, probably leading to a better coverage of the construct, a good discrimination of responses while reducing automatic responses introducing response biases?
  • – or by excluding reversed items that favor a simple factor structure, that is one that is unambiguous and more reliable?

The reflection on the use (or not) of reversed items was extended to the question of the number of items to be retained of each type (negative and positive) in a scale. For single-item Likert measurements, Alexandrov (2010) notes the superiority of positive formulation. To avoid the problems associated with mixing the two formats in multi-item scales, some recommend an equal number of positive (regular) and reversed (negative) items. Understanding biases seem to be better controlled when the scale is balanced; a different number is problematic (Roszkowski and Soven 2010). However, Chen et al. (2007), although they used a balanced scale in terms of the number of items in each category (4 positive and 4 negative items), still find two factors following a principal components analysis, each of which included either positive or negative items. Dickson and Albaum (1975) find that the responses for three formats of a semantic differential scale with different polarities2 (positive, negative and balanced) are not significantly different.

Other reasonings, exploring the importance of mixed items, have revealed that the problem of the integration (or not) of reversed items is probably located elsewhere and more particularly in the multiplicity of possible formulations of the items. Several types of negation and inversion are possible where it is easy to understand that an item can be reversed without containing a negation and vice versa. Each type of item can have different consequences on the answers obtained. Weijters and Baumgartner (2012) suggest that negatives should be used sparingly. According to them, there are few benefits to using negatives that do not lead to a reversed item. Roszkowski and Soven (2010) share this view, as they believe that negative items cause confusion in the minds of respondents, so they are only interesting if respondents are not careful. According to Chang (1995), items with different connotations (coherent/incoherent) do not necessarily induce the same construct. He suggests only using items with connotations consistent with the rest of the items on a scale and with the construct object of a measurement, even if the objective is to circumvent response biases.

On the other hand, for Weijters and Baumgartner (2012), the use of reversed items (not negative) has the particular advantage of encouraging better coverage of the domain of the construct. Although reversed (oppositely worded) items probably produce artificial factors, their elimination is not always necessary (Spector et al. 1997) and may mask inconsistent responses; they should therefore be used with caution (Weijters and Baumgartner 2012). Hartley (2013) points out, on the basis of a body of work, three particular findings: first, it is difficult to write exactly equivalent terms in a positive and negative form; second, the respondent has difficulty reversing their thinking (understanding) in order to provide an assessment of the items; third, different ratings are given for the positive and negative versions of the items. For Hartley (2013), two options are then possible: remove negative items from a scale or present the results separately for such items.

5.3.1.2. Arbitrating the choice of item formulation

From the controversy over the need (or not) to use a mix of items, it is clear that some accredit such a use, due to its ability to absorb certain biases. Others assume that it is only a decoy and that other biases can be promoted by such a mix.

We believe that, in any case, reversed items will probably always exist according to the standard of understanding chosen, whether this is to clarify certain phenomena (for example, positive and negative emotions) or whether it is in the minds of some respondents who have points of view opposed to the idea proposed by the item. In addition, the items (regular and reversed) may take different formulations for the same idea:

  • – positive: “I am happy”;
  • – positive with reversed polarity: “I am sad”;
  • – negative: “I’m not sad”;
  • – negative with reversed polarity: “I’m not happy”.

Negative formulations seem to be the least successful and quite often deserve to be discarded. Nevertheless, each format can have many consequences (positive and negative). These can be grouped according to several logics (statistical, conceptual, etc.). The concern, therefore, in our view, is not whether or not to use a particular type of item, but rather to focus on the merits that a type of formulation could bring to a scale. The objective of the final choice is to promote responses that reflect both respondents’ perceptions and the latent constructs to be measured. In other words, while fitting as closely as possible to the field of the operationalized construct, the measure must provide as accurately as possible the respondents’ position on the dimensions of the construct while minimizing any forms of bias.

Whatever the formulation of the items in a measurement scale, it is imperative to carry out a set of preliminary examinations:

  • verify the respondents’ answers. This recommendation is traditionally made, but is often omitted when coding responses. Examining the responses, particularly when they are entered, can effectively help identify inconsistent responses to be eliminated from the sample of respondents. Indeed, is it possible to “strongly agree” and “strongly disagree” about the same idea formulated in different items? Far from the recoding considerations that it is wise to undertake when certain items are reversed, it is logical to expect the same respondent to have the same or similar position (example: degree of satisfaction) about the same evaluation object. If there is an inconsistency in responses, it probably reflects that the respondent did not pay much attention to the content of the items and responded in a non-involving way. It is then preferable to remove the responses of the individual in question from the final sample. The omission of such a verification operation is often accentuated when the coding is done automatically, in particular by a third party, far from the researcher’s eyes;
  • check the psychometric qualities according to the formulation of the items. Note that through this recommendation, we suggest that you pay attention to these properties when choosing a formulation for the items. The “recoding” operation (inversion of reversed item scores), practiced for statistical purposes when a scale includes mixed items, cannot by itself solve the problems associated with the items. Indeed, weak psychometric qualities are not exclusively caused by an item(s) (and/or dimension(s)) that is superfluous and useful to remove from the scale. Faulty psychometric indicators may be due to inappropriate formulations of the items in question. At the beginning of the scale development process, the researcher sometimes wishes to retain more or less ambiguous items for exploration purposes and often hesitates about which labels to use and which formulations to retain. Even if experts can shed light on the choice, the initial blur may persist;
  • check whether the structure of the scale obtained, through the formulation of the items, is in conformity with the conceptual framework defining the construct and clearly delimiting its boundaries compared to other close constructs. In this respect, a factor structure can only be considered artificial if an explanation of the significance and importance of some of the dimensions identified is not possible. Indeed, an artificial structure can be generated by mixed items, but also by regular items and vice versa. In addition, an unplanned structure can actually represent a phenomenon (construct) with several facets (dimensions). The examination of the structure obtained requires great attention to the content of the items and theoretical debates around the phenomenon to be measured. Moreover, this review potentially requires the use of experts (respondents, researchers, etc.) in order to allow a more detailed reading of the results;
  • check with the study population to see if individual differences (cultures, socio-demographic characteristics, etc.) could lead respondents to react differently to certain types of item wording. This would identify those that are likely to generate response bias (e.g. confusion in understanding an item caused by different educational levels of respondents) and those that could be perceived in equivalent ways by all categories of respondents. Indeed, since the purpose of a scale is to be able to quantify a phenomenon (construct) in order to study the individual differences related to it, it is important that the population studied can first have the same understanding of the phenomenon as explained through its indicators. In this case, the wording used could more accurately examine the real differences between individuals about the construct under study. Verifying the meaning of the items, according to the wording envisaged or even different formulations, in relation to respondents, is likely to remove (or at least reduce) the doubt about the confusion and difficulties that an item can create, preventing them from providing answers in line with their experiences and opinions;
  • in case mixed formulations are used when developing a scale, it is useful to verify through explorations with experts (and/or individuals from the population) that the items formulated in one direction (example: negative) are equivalent to those formulated in the other direction (example: positive) and represent many aspects of the phenomenon concerned. This is likely to reinforce the face validity of the items listed. Indeed, it happens that different formulations of an item, supposed to reflect the same idea, can be perceived differently and therefore change meaning simply by changing the statement;
  • check that the interpretation of items, according to their formulations, is not affected by other design variants of a scale, in particular its length (number of items), the number of items in each formulation category (example: number of items with positive connotations and number of items with negative connotations), the order of the items, the number of response points, the labels used for the response modalities, etc. All these aspects can effectively affect the equivalence of items on a scale.

These various recommendations should not be seen as a succession of fixed steps, but as a set of warnings. Moreover, they depend largely on the context of the study, the construct to be operationalized, and can be mobilized in different ways. It is sometimes possible, during the same test, to collect information in order to verify both the understanding of item formulations and to statistically test the structure of a scale or even test other variables of the scale design (number of items, response modalities, etc.). In addition, in intercultural investigations, additional checks on the equivalence of words, labels, facets of behavior, etc., are crucial. These recommendations are to be used as a guide for adjusting the formulations of the items to be gathered.

5.3.2. Label of the response modalities

The words used in the response modalities may be equally important to consider. Each label may indeed be perceived or understood differently by respondents, particularly in different language contexts, favoring certain categories of responses rather than others. Generally, items are evaluated on the basis of response labels composed of positive and negative assessments, which are supposed to be opposed. As such, Rozin et al. (2010) note that some positive words do not have opposing negative words in some cultures and languages. Bartikowski et al. (2006) point out that a word translated into another language may be lexically equivalent, but may have a different meaning. Seeking equivalence in the formulation of items without regarding response modalities can have serious consequences on data quality. It should be noted that the considerations relating to the labels to be used are not exclusively related to intercultural research projects. Labels may also give rise to different interpretations by respondents within the same cultural context.

Although some researchers, such as Cox (1980) and Wildt and Mazis (1978), have long suggested that the labels chosen have effects on responses, this subject has remained relatively neglected in favor of other aspects of scale design. However, in order to assess the idea underlying an item, the respondent relies heavily on the proposed response options. Lam and Stevens (1994) noted in this regard that the content of items should not be examined apart from response labels. They observed that different formats (items-labels) can lead to different results. If labels, being associated with all items on a scale, are ambiguous, inappropriate, etc., then they can expose serious problems affecting the representation of the construct under study.

Among the few studies undertaken on this subject, Revilla (2015, p. 235) found that the use of labels, mentioning “higher frequencies or durations, increases the proportion of respondents reporting higher frequencies or higher time spent on the corresponding activities”. In addition, Weijters et al. (2013) analyzed, in different multi-lingual cultural contexts, the effects of two aspects associated with a label: intensity (extreme responses) and familiarity (use in the language). They found that the choice of labels has consequences on responses: on the one hand, the higher of a label’s intensity is perceived by respondents, the more the response is perceived as extreme, and on the other hand, the more familiar the labels used are to respondents, the more frequently they are associated with use. This leaves little doubt that clear labels should be used, assigning the same meaning to all respondents.

It should also be noted that it is interesting to question the label of the central point when the scale has an odd number of response modalities. For a Likert-type format, for example, should the neutral, don't know, indifferent or neither agree nor disagree label be used? Do each of these formulations lead to different perceptions? As such, although the neutral and neither agree nor disagree options seem to refer to the same level of agreement, don’t know or indifferent probably result in a lack of opinion. Although some researchers have not observed any significant differences between some central labels, such as Armstrong (1987) who compared two labels: neutral and undecided, it is crucial to consider whether different labels can be perceived and interpreted differently.

5.4. The arrangement of the design components of a scale

The verbal content of the items, their number, etc. are not the only supports on which respondents base their answers. The visual language and the layout of the different response options can be decisive. Despite a properly formulated number of items, sufficient and understandable response modalities, poorly adjusted configuration and marking of the different choices available to respondents can hinder responses. First, we will examine some of the non-verbal attributes involved in the design of a scale, which can at least help to obtain reliable and valid answers to the phenomena to be measured. Second, we will point out the interest of their exams.

5.4.1. Marking the scale attributes: some results

The items composing a scale can be evaluated using a multitude of formats: Likert, semantic differential, etc., as we pointed out in the first chapter. These formats include, among others, labels and/or numbers. That said, labels such as strongly agree, strongly disagree, ratings such as “1”, “5”, “-2”, “+2”, can be considered. Which position (right or left) to choose, then, for each of the labels and numbers selected? Should labels be associated with all the possible answers proposed or exclusively presented at the extreme points of a scale? Similarly, should numbering accompany all or some of the labels? It is easy to observe that these kinds of questions do not receive much attention. Nevertheless, research has shown that they are factors that can modify respondents’ appreciation of the indicators (items). In the following, we will discuss some of these attributes, in particular considerations of order, association and visual presentation.

5.4.1.1. Position of numerical and verbal attributes of a scale

For a long time, the importance of the position of response labels in the expression of individuals’ choices on a scale has been noted (e.g. Belson 1966; Wildt and Mazis 1978). Hartley and Betts (2010) and Betts and Hartley (2012) found that different arrangements, relating to the position of positive labels on a scale (Likert) and the associated numerical values, produce different effects. More clearly, they observed that with positive labels and high scores to the left of a scale, the scores obtained are significantly higher than the scores resulting from other combinations. Betts and Hartley (2012) identify several possible explanations for this, noting more specifically a visual attention bias to the left as the usual left-right effect in reading native English speakers. Chan (1991), using a Likert scale to present (left or right) positive (describes me very well) and negative (does not describe me well) response labels, noticed different means and estimates of the latent trait depending on the formats tested. More specifically, he found that with the positive format on the left, a better estimate of the trait was obtained.

But can this order effect vary across cultures? For example, in Arabic-speaking countries where reading is done from right to left, does a positive format on the right become necessary? For other countries, such as Japan or China, where reading can take place in other directions, for example, from top to bottom, can the previous horizontal presentation format (left or right) be relevant to the evaluation of an item? In addition, it is often possible to observe, when administering a scale, a presentation of items in two or more languages (e.g. Arabic, French, English) in order to reach a larger sample of the population. Should the items, response labels (positive or negative) and their associated numbers (high or low) be adjusted on the right or left? Although it is probably easier to choose the position of these variants when it comes to the use of languages with the same reading direction, the choice seems more complicated when it comes to several languages written in opposite directions.

Similarly, the position of the numerical values assigned to the different response modalities may be assessed differently from one group of respondents to the next. This evaluation can also be significantly affected by the construct to be measured. Indeed, some phenomena (abstraction of the items that are supposed to understand them and their formulation) have an intrinsic negative connotation, while others have an intrinsic positive connotation. Respondents can expect to find numberings (high or low) and response labels (positive or negative) to the left or right, depending on their perception of what best corresponds to the expression of their attitudes.

5.4.1.2. Association of labels and numbering with response points

Apart from considerations of order, the numbers associated with the response modalities may also have different meanings depending on the respondents. Schwarz (1999) points out that a respondent’s interpretation of a scale may indeed vary depending on the meaning given to the association between a response label and a number. Schwarz et al. (1991), having retained labeled scales only at the ends, note that responses vary according to the numerical values assigned to the extreme points of a scale. Having compared the results obtained using a semantic differential scale in two formats: one with exclusively labels and the other with exclusively numberings, Dolch (1980) found that the two formats give rise to different factor structures, although the correlations between the items in the two formats are high.

Differences in perception can also be recorded in a format where only the extremities are labeled and another where all response modalities are labeled, where the presence of labels only at the extreme points probably forces the respondent to undertake a more elaborate cognitive processing in order to imagine the meaning of the other points (Lantz 2013). Weng (2004) notes a trend towards more reliability (test-retest procedure) when all points are labeled compared to a label just at the ends of the scale. Similarly, Eutsler and Lang (2015) found that a scale where all modalities are labeled, compared to a scale labelled just at the ends, minimizes response bias, maximizes variance and lessens measurement error. Moors et al. (2014) recorded variations in extreme response styles by format, with respondents in particular using extreme responses more frequently with a format where response modalities are labeled exclusively at the ends (versus labeled at all points). It appears, a priori, that a clearly formulated label of the response modalities leads to more stability in the understanding of the items to be evaluated. In addition, the presence of labels and a numbering for each of the modalities seems to offer a valuable guide to respondents in their quest to understand and evaluate the items.

Moors et al. (2014) point out that bipolar numbering (example: from “-3 to +3”), instead of positive numbering (example: from “1 to 7”), can be difficult to use for scales evaluating agreement, thus negative values can be confusing. Positive values (example: 1; 5) can mean the absence or presence of a phenomenon, a mix of positive and negative values (example: -2; +2) can mean opposite poles (Schwarz et al. 1991). But this is, of course, not always the case. Indeed, positive values (1; 5) can sometimes mean the understanding of aspects that are opposites.

For example, to measure the influence of the husband and the wife in the couple’s purchasing decision, the number “1” can be associated with the dominant role of the wife, while the number “5” can be associated with the dominant role of the husband (or vice versa), thus, two positive numbers that reflect two opposite directions in the understanding of the distribution of roles within the couple.

In addition, the nature of the phenomenon to be measured may facilitate or promote the use of one format at the expense of another. Respondents’ expression in relation to some items related to the construct of happiness, for example, is probably facilitated by an assessment grid (labels and numberings) highlighting positive labels and positive numbers (example: label “strongly agree”; note “5”), rather than the association of a positive label (“strongly agree”) with a negative numbering (-2). Indeed, the latter case (positive label, negative numbering) could express, consciously or unconsciously for the respondent, the lack of happiness. The respondent could, perhaps for reasons of social desirability, have difficulty expressing themselves due to the confusion that could be generated by this association.

It should be noted that even the chosen response format (example: Likert or semantic differential) can constrain or facilitate the choice of numbering. By way of illustration, for the semantic differential scale, it is a question of approaching the phenomenon with reference to opposite poles. It then seems easier to use mixed values (positive and negative) that can more explicitly express this bipolarity. Assigning a score of “+2” to the performing adjective and a score of “-2” to the non-performing adjective does not necessarily influence the respondent’s understanding.

5.4.1.3. Other visual attributes of a scale

Other aspects can be decisive in the architecture of a scale and the arrangement of its different components. For example: images accompanying verbal (and/or numeral) descriptions of response modalities, symbols (boxes, hyphens, etc.) marking response choices, presentation of the scale on the page (example: drawn in portrait or landscape mode, one or more items per page, a vertical or horizontal presentation of response modalities), etc. These elements significantly contribute to the visual appeal of the scale. Although studies on these types of attributes are relatively few in number, Tourangeau et al. (2004) observed that respondents’ interpretation of items is influenced by visual presentation attributes such as the spacing between response options, the order of response options and the grouping of items. They found that these attributes affect respondent choice and response time.

Visual attributes seem more relevant during self-administration, especially via the web. In the latter case, Van Schaik and Ling (2007), having tested different scale formats, observed that each can have certain advantages and disadvantages according to a multitude of criteria (e.g. respondents’ preferences, spontaneity of responses, response time), even if the psychometric qualities resulting from the use of different formats can be comparable. Deutskens et al. (2004) observed that a long questionnaire with images (versus exclusively a text) reduces the response rate. This is probably due to, according to them, the time it takes to download the photos, which leads respondents to abandon the questionnaire. However, they noted that with the visual mode, respondents use the modality of answer don’t know less; once the subject of the question is visualized, respondents are probably better able to report their opinions. Are photos, animations, colors, etc. assets to guide the respondent or will they disturb them in their understanding of a content and even lead them to abandon the questionnaire?

5.4.2. Points of reference

It is clear that different aspects of the design of a scale can lead to different assessments by respondents. We can even argue that the inconsistent conclusions between different research projects are due, among other things, to the design of the measurement scales and more particularly, to some of their components (labels, numbering, presentation, etc.). Certainly, other considerations should be taken into account to test the robustness of these conclusions, such as the respondent’s tendency to express themselves in a certain way (positive or negative), their response style (example: tendency to favor extreme responses), the mode of administration (face-to-face, telephone, etc.). It is clear that conceptually formulated hypothetical links between different phenomena assessed from different scales in terms of formats (types of labels, visual aspects, etc.) can affect the quality of results.

Without dwelling on this point, it should be noted that a questionnaire composed of a set of scales of different constructs with harmonized formats (type of scale, numbering, label-numbering association, position of labels, etc.) would seem to be able to reduce certain biases and improve the quality of the results. Admittedly, this is not always possible, particularly when the objective of the study is actually to test different formats of a scale or when the different constructs to be grasped, particularly during the phase of establishing nomological validity, are not suitable for the same measurement formats. In any case, it is useful to reflect on these aspects in order to optimize all the choices associated with the measurement tool to be used.

The test of a scale should not be limited exclusively to understanding the items, checking their content or, even better, their psychometric properties, as is often the case. Indeed, different variants (position, numbering, etc.) affect respondents’ reactions to the items. Some formats may not be appropriate for some cultures or respondent groups. Care must therefore be taken to choose formats that reduce understanding and interpretation bias and provide reliable and valid results. To this end, an exploratory study can be used to observe the most appropriate order of presentation for respondents, that is, the one that best covers their feelings, opinions and attitudes. Individual or group interviews also allow interviewees to react to the subject under study and see how they verbalize their attitudes. Finally, it should be noted that, even if it is difficult to control all the components of a scale, knowledge of the consequences that they can generate is very important when reading the results obtained when using the scale.

5.5. Conclusion

It is rudimentary to rely exclusively on the steps for constructing a scale mentioned in the previous chapter without paying attention to the specific design of the items, their formulation, the number of points to retain, etc. In this chapter, we have focused on the main components of a scale. Without going back over all the aspects discussed, we would like to insist on the fact that the development of a scale does not involve a succession of fixed steps consisting of defining, purifying and validating a number of indicators covering a construct. The production of a reliable and valid measure requires much deeper reflections focused on all the attributes of a scale and no aspect should be neglected or considered as minor. Indeed, all the components of a scale contribute to the development of a tool that promotes an understanding of reality as it is experienced or felt by the individuals about whom the researcher is interested.

Although the implications of some of the aspects studied have not yet been fully explored in literature, it is still essential when constructing scales to be aware of the importance of quantitative (number of items, number of response modalities) and qualitative (verbal and visual) components as determinants that can affect responses and thus the quality of the data obtained in terms of reliability and validity. When choosing the design of a scale, it is important to take the time to check whether certain attributes can generate biases to neutralize them. Of course, the responses that highlight real differences between individuals in terms of behavior and evaluations specific to the construct under study would not be biases, but those resulting from a measurement protocol that is not adequate for capturing reality around a construct are often biased. Indeed, the measuring instrument can lead to inaccurate, confusing and incorrect answers.

5.6. Knowledge tests

  1. 1) What are the main components of a multi-item scale?
  2. 2) How does one determine the number of items on a reflective scale? Does the same reasoning apply to the number of items on a formative scale?
  3. 3) Between an even and odd scale: which is the best to choose? How many response modalities should be used?
  4. 4) What is a balanced scale? When is it appropriate to use it?
  5. 5) What does the notion of a reversed item mean? When is it appropriate to use this type of item?
  6. 6) How does one obtain a semantic differential scale with reversed polarity?
  7. 7) Present a balanced 8-item forced-choice Likert scale to measure attitude towards sales.
  8. 8) Should all the points of a scale be labeled?
  9. 9) How important is it to test a scale?
  10. 10) Give some visual attributes of a scale. How important are they in the development of a scale?
  11. 11) What advice would you give to the developer of a questionnaire?
  1. 1 This involves describing whether a scale is symmetrical in the proposed response choices (as many positive as negative values) or asymmetrical, thus having (in addition to positive and negative values) a clearly identified central (neutral) point.
  2. 2 Since the semantic differential scale format is bipolar in nature (see Chapter 1, section 1.3), the inversion of its polarity is done by manipulating the position (right or left) of positive and negative adjectives (or sentences).
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.143.65