CHAPTER 4

Measuring Public Relations Outcomes

Previous chapters noted the need for public relations professionals’ understanding and ability to measure the outcomes they hope to achieve in their campaign programming. This is not a new topic, yet it has taken on tremendous interest over the past 25 years as the profession sought to be able to demonstrate effectiveness. Public relations measurement, as noted in Chapter 3, often deals with what academics would call mediating or intermediary variables—things that will impact on the final business outcome but are not necessarily financial in nature. Organizational, brand, or product credibility is one of those variables; something that cannot be seen but can be measured and verified as a valid and reliable predictor of final outcome, which for public relations is increasingly focusing on nonfinancial indicators. Furthermore, since the beginning of the 21st century, scholars and professionals have pushed for a standardization of measures used to create scalar measures to ensure equivalency of data and produce comparative analyses (Michaelson and Stacks 2011). This chapter introduces the readers to measurement and evaluation from a social scientific perspective, one that allows the public relations professional to create reliable and valid measures of nonfinancial indicators (see Chapter 3). The chapter differentiates between hard financial indicators and soft nonfinancial indicators. As discussed previously, nonfinancial indicators include, but are not limited to, measures of confidence, credibility, relationship, reputation, trust, and indicators of awareness, interest and understanding, desire, and intent to adopt (Stacks and Carroll 2004).

Fundamentals of Data and Measurement

What does it mean when someone says that they are measuring something? Does it mean that what is being measured is observable? Does it establish differences or similarities between objects of measurement? Is it something that comes naturally? Is it done to create data that can be used to establish relationships between things? Finally, does it mean anything in particular? All of these are questions that measurement researchers ask daily when they try to create, use, or evaluate measures for any use. Public relations measurement researchers are no different, except that they have entered the process a little later than their promotional communication colleagues.

So, what does it mean when we say we are measuring something? It means that we are establishing a ruler that allows for comparisons and interpretations of “data” obtained from those measures (Stacks 2017). For instance, in the physical world, measurement is fairly straight forward; we often use inches, centimeters, yards, meters, miles, and so forth to measure linear distance (although it is unclear from this whether we are talking of horizontal or vertical distance). What we have established, however, is a metric—“[a] numeric value associated with campaign research demonstrating statistically whether outtake or, outcome, or both objectives are being reached” (Stacks and Bowen 2013)—that can be used to describe something; distance or height in this case. However, in the social world we also measure things that are not as observable, but are “potentially observable”—credibility, reputation, trust to name but a few things. Based on our created nonfinancial measures we can then make decisions as to whether something is longer or shorter, higher in credibility, or lower in trust when compared against something else.

However, notice that the interpretation of the measurement is usually less precise than the measure itself. Measurement interpretation often says something is heavier than something else—and in many instances this can be done with different measures, pounds of personal weight as compared to stones of personal weight. While measurement is more precise and can be tested for its reliability and validity, interpretation is, well, determined with words.

Measurement as a Public Relations Tool

In its simplest form, measurement is an observation. From a business perspective, measurement is most often used to track financially related variables. These variables include number of units produced, stock prices, gross and net profits, and so forth. Their measurement is quite precise since the data are hard, that is, they are directly observable and can be counted. Marketing can tell how many products are sold per 1-800-phone call-in, can calculate the cost per call across a number of other hard data points—number of hours staff put in, returns, and so forth (see Chapter 3). Human resources can provide a breakdown of cost per unit by employee and employee costs (wages and benefits, for instance).

From a public relations perspective, measurement is less precise because the data are soft. Soft data are not easily observed, which is one reason why investment in public relations measurement has suffered. Instead of focusing on the mediating variables that affect business outcome, public relations counted simple indicators of distribution. As such the “clip book” measured success in the number of releases sent out and picked up in the media (numbers), we could count the number of “likes” on Facebook and the number of “Tweets” mentioning a client or brand, the story’s placement in the media (could be measured as above or below the fold, page number, presence, or absence of an accompanying photograph), or related to the equivalent cost of comparable advertising (a measure that the Commission on Public Relations Measurement and Evaluation has deemed inappropriate and misleading but many still use) (Commission on Public Relations Measurement and Evaluation 2009). Importantly, since the mid-1990s measurement has become important to public relations because public relations theory and strategy driven by that theory, driven hard by public relations academics and some professionals, who have backgrounds in anthropology, communication, psychology, and sociology, have argued hard for the mediating impact of public relations and demanded that public relations measurement begin to focus of nonfinancial variables on bottom-line results. This is not to say that simple counts are invalid; they are just one of a number of measurement tools the professional can use to demonstrate effectiveness, but they are not as precise or do they provide the data required to demonstrate impact on bottom lines.

Once it is established that reliable and valid nonfinancial measures can be created, their effectiveness during and after the campaign as related to the financial indicators can be assessed. What this provides the public relations professional is a way to actually establish campaign impact against planned benchmarks, compare public relations effectiveness as related to ancillary marketing and advertising indicators, and at campaign’s end demonstrate impact on final return on investment within the public relations department and the general business as a whole. Stated in terms of public relations effectiveness, measurement is an integral function of the Excellence Pyramid’s basic or proponent level (see pp. 29–32) (Michaelson, Wright, and Stacks 2012).

So, what exactly should public relations measurement be concerned with? Where there is hard data based on financially related measures, they should be gathered, interpreted, and evaluated. Where there is soft data based on nonfinancially related measures, those measures should be created or, if available from academic or business sources, adapted, gathered, interpreted, and evaluated. In the end, both sets of data—financial and nonfinancial—should be used to assess impact on public relations and general business goals and objectives. Nonfinancial measures are related to particular public or audiences’ awareness, knowledge, values, beliefs, attitudes, and intended actions—variables that from the social sciences we know impact the decision making (Stacks and Salwen 2009; Botan and Hazelton 2006).

Data and Measurement

Measurement is a systematic process of observing and recording of those observations as data (Stacks 2011, 45). Stated differently, it is “a way of giving an activity a precise dimension, generally by comparison to some standard; usually done in a quantifiable or numerical manner” (Stacks and Bowen 2013, 18). Data are the observations themselves that are used for comparative or descriptive purposes (Stacks and Bowen 2013, 8). As noted earlier, financial and nonfinancial data are different; while financial data can be actually counted, nonfinancial data must be collected as reflecting an individual’s values, beliefs, and attitudes. The collection of data differs then in that nonfinancial data come from measurement developed to assess an individual’s opinions that reflect their inner thinking processes. This is covered in a later section, but for now the basics of data need to be addressed.

Basically, data can be defined as existing in four distinctly different forms or levels. Furthermore, these levels are systematically linked by how they are defined. How the particular data are defined influences how they are interpreted and ultimately evaluated. The definitional process begins by establishing whether the data are based on categories (categorical-level data) or along a continuum (continuous-level data). Furthermore, in terms of evaluation, data that are defined as continuous are generally more powerful (they provide more information about the observations in terms of being able to calculate the mean, standard deviation, and variance of observations) than data that are categorical, which can only be reported as simple counts, percentages of category, or simple proportions. Hence, we have two types of data that can be subdivided each into two levels.

Categorical Data

At the categorical level we have nominal-level data, data that are defined by simply systematically naming the observations but making no assessment beyond the names. For instance, there are many different ways that public relations measurement is practiced. Distinguishing between corporate and agency, for instance, is measurement of public relations practice that simply distinguishes between two of many different practice units. The measure says nothing about which is more important, which is more effective, which is more representative, or which even has more professionals practicing in the area. Traditional public relations measurement dealing with the number of press releases picked up by the media would produce nominal-level data, the measure would tell us how many releases were picked up in various papers, but not the quality of the release or whether the stories were accurate.

As noted earlier, nominal measurement’s role is to simply differentiate between subcategories of some outcome variable. For instance, a public relations campaign may want to get a simple indication of an attributed message source’s credibility or of several potential attributed sources. The simplest and probably most direct measure would be to ask people whether they thought the sources were credible: “In your opinion, is so-and-so a believable source? Yes or No.” Similarly, one of your authors in conducting public relations gubernatorial candidates in the late 1970s asked survey respondents “Who is the governor of Alabama?” and then checked off the names reported. Each name was equal to other names; hence, the measurement was nominal. Finally, intent to purchase or recommend purchase is often measured via a simple “Do you intend to purchase or recommend stock in company X? Yes or No.”

If we define our measurement system as more than simply distinguishing by assessing some relative quality—larger or smaller, expensive or cheap, taller or shorter—the measurement system produces ordinal-level data, data that order the observations systematically based on some criteria clearly defined in advance. Thus, we could measure public relations units by the number of clients they had or by their net income or by the number of employees each had (large, medium, small). The measurement creates an ordered set of data that differ on some preassigned quality. Of importance to ordinal-level data analysis is that there is no overlap between categories. For instance, age is a variable that is difficult to get survey responses to when respondents are asked, “What is your age in years?” People often refuse or will stretch their responses from the truth. The same is true of income questions. Ordinal measurement provides a way around both problems by establishing a systematic system: Age may be under 18; 19 to 25; 26 to 50; 51 to 65; and over 65 and income may be under $10,000; $11,000 to $20,000; $21,000 to $50,000; and over $50,000. Note that the categories are not equal, or are they intended to be; instead they represent an ordered number of categories that meet some measurement criteria (could be based on previous research or could be from an analysis of census data).

Ordinal-level measurement is often called forced choice measurement because respondents must choose a category or be placed in a nonresponsive category by the researcher. Hence, an ordinal measure of credibility would take the form of naming a source and asking a respondent whether she was very believable, believable, or not believable. Should the respondent not make a decision, he would be placed in a refused to answer (RTA) category. An ordinal measure of awareness might be phrased, “How aware are you of the current governor of Alabama? Very aware, aware, not aware.” For the intent to purchase measure, an ordinal measure might be stated as, “Company X is offering stock at $xx.xx, how sure are you about purchasing this stock? Will definitely purchase, may purchase, may not purchase, definitely will not purchase.” For both the final examples, refusal to answer for whatever reason would result in the respondent being placed in the RTA category.

Continuous Data

Continuous-level data are found on a continuum and how that continuum is defined dictates what type of data they are. Continuous data that exist within an interval on the continuum are called interval-level data because it does not matter where in that interval the observation is within the interval; it is measured as being that interval. It does not matter where in the interval the observation is, just that it is observed as the interval. Hence, the difference between the numbers 1, 2, and 3 are exactly one unit from each (1 is one unit from 2; 3 is two units from 1 and one unit from 2). This will become more important a little later, but for now think of age as the variable being measured. Upon a birthday you are one year older, even though you may only be one day older than you were the day before. If thinking in terms of money, five dollars is an interval measure. A majority of public relations measures produce interval-level data (Stacks 2011).

Public relations use of interval-level measures, as will be expanded on shortly, has been limited, primarily because the profession has employed a marketing approach to measurement that forces respondents to make a definite choice regarding the measurement object. Interval-level measurement requires that the perceived distance between points on the continuum appear to be equal, hence the forced choice does not allow for a respondent to be unsure or uncertain or undecided. An interval measure would add an arbitrary mid-point among the categories allowing respondents to be uncertain and would make RTA a truly nonresponsive choice. Hence, our interval measures for the earlier examples would allow for uncertainty. For our credibility measure the responses would be, “very believable, believable, neither believable nor not believable, not believable, very not believable.” In the same manner, awareness responses would be, “very aware, aware, neither aware nor unaware, unaware, not aware,” and the intent responses would be, “will definitely purchase, may purchase, may or may not purchase, may not purchase, definitely will not purchase.” Measures that produce data that can be found anywhere on a continuum that has a true zero (0) point are called ratio-level data. Most hard financial data are interval in nature: units produced, employee work hours, and gross or net income in dollars and cents. Furthermore, since there is a true zero point (interval data may have a zero point, but it is arbitrary just where it exists along a continuum), the measure can further be defined in terms of absolute difference from zero; hence, it can be used to measure profit and loss.

The use of ratio-level measures in public relations is found even less than interval measures. A ratio measure would ask respondents to make a decision based on where on a continuum they would fall regarding the object of measure. For instance, we might ask, “On a scale of 0 to 100, where 0 is the complete absence of credibility and 100 is completely credible, where would we place X.” The same would be done for awareness and intent.

Continuous-level data are found in nonfinancial measures, but primarily at the interval-level. Public relations has not required the precision of ratio measures for a number of reasons, but primarily because the measurement of psychological indicators, of which credibility, confidence, relationship, reputation, and trust are commonly used measures do not require that much precision. They measure what a person thinks, but such measures have been correlated to actual behavior and find high correlations between what is expressed (opinion) and what is done. Ratio-level measures are also more difficult to administer and require sophisticated statistical knowledge to interpret and evaluate. Furthermore, as evidenced by the thermometer example, which many would consider to be a ratio measure, the zero point (where water freezes) is arbitrary—either 0°C or 32°F. The actual ratio-level measure would be in Kelvin, where 0 K is equal to –237.59°C or –459.67°F. Such precision is not required in most public relations measurement systems.

Datasets

Before moving on to creating and using measurement, we need to talk a little more about the data itself. As defined earlier, data are observations. Those observations are defined as “variables” that have been operationalized into quantitative metrics that “define” the observations in a systematic and verifiable way. A number of variables and observations are placed in a dataset—a structured or unstructured formation that often takes the form of a spreadsheet, such as Excel (or as we will discuss later, as a SPSS data file). How the dataset is structured and how large it is determines if it is one of the three types of datasets: small, large, or big.

Small Data

The dataset that most of us think about when we conduct research is a relationship between specific variables and specific observations that is structured. In small data, variables are analyzed as if they are the only ones in the research. Small data are used primarily to test assumptions, answer specific research questions, or test hypotheses. It rarely has more than 1,000 observations and limits itself to only those variables that the researcher feels are important.

Large Data

Large data are a structured dataset, but it has many more variables and observations than does a small dataset. Large datasets commonly have over 1,000 or more observations and may include as many variables as the researcher believes may be active in the target audience. Large data are used primarily to establish norms against which specific subpopulation findings are tested. Large data are most likely to be “field” data or data from studies that involve “real world” condition. By contrast, small data are more likely to be derived from “experimental” studies that take place in controlled environments. Each, as we will discuss in later chapters, has its own advantages and disadvantages.

Big Data

The current fad today is to focus on “Big Data.” Big data are a huge dataset or multiple datasets with many, many variables and thousands or more observations. Much of these data come from data “warehouses” from which the data have been inputted and then analyzed. What makes big data unique is that the dataset is unstructured. Unstructured datasets provide analyses by allowing the computer program computing the analyses to create relationships among variables according to a particular algorithm (a formula).

One of the most common uses of big data is to identify audience or stakeholder segments that have an impact on or can influence outcomes. A segment is a particular part of the dataset that describes customers, brands, and so forth by some differentiation system. For example, segments may be demographic, they may be social-economic, or they may be psychographic. Understanding the unique relationships the variables have with each other allows researchers to set priorities or proportions of audiences for each variable.

Distinguishing big data can be done by looking at what has been called the “seven Vs”: volume, velocity, variety, veracity, variability, visualization, and value. Volume refers to the size of the dataset and is usually referenced in terms of bytes of data; big datasets typically are described in terms of petabytes (one quadrillion bytes of data). Velocity deals with the speed at which the data are “captured,” often from online data streams. Variety means that you are not limited to quantifying your data into numbers (e.g., male=1; female=2); you can actually use anything for data: GPS coordinates, text, logos, and so forth. Veracity focuses on the trust you place in your data, especially if you are using artificial intelligence (AI) programs to create relationships and new variables from those relationships. Visualization refers on how you take humongous datasets and demonstrate the relationships visually as the data coming in may be so varied and fast that it is difficult to make sense of them at any one given time. Finally, value refers to what you plan on doing with the data—it may be that you are testing a strategy or theory about segments of your target audience; in so doing you are defining what you expect to find.

Each of these Vs gives big data an advantage in a fast-moving and Internet-interrelated world. Careful consideration must be taken when deciding (a) what kind of data you are using, (b) what types of data you will be analyzing, and (c) whether the data analysis is historical (i.e., you collect the data and analyze them later) or relational (i.e., you analyze the data as they create new variables from current variables and observations in real time).

Big data are here to stay and computers that have the speed and memory are being developed as faster and more powerful every year. You may never use big data, but you should understand what it can and cannot do for you.

Creating and Using Measurement Systems

To better understand measurement and evaluation, let’s try something. In the left margin of this page, put the total number of words contained on the page as a ratio measure. (Why is it ratio? Because it could range from zero—no words—to the actual count you get.) Now count the number of sentences on the page and put that in left margin as an interval measure. (Why interval? A single word could be a sentence if it has punctuation, although there might be no sentences, there are words on this page.) Now count the number of long, medium, and short sentences as an ordinal measure and put it in the left margin. (Why ordinal? Because you will have to come up with a rationale for what constitutes a long, a medium, and a short sentence—say long would be more than 10 words; medium between 9 and 4 words; short less than 4 words.) Finally, count the number of nouns and then pronouns on the page and put it in the left margin.

Please turn back to Page 56 and reread that page and then come back to this page. Now run your counts again and put what you observe in the right margin for each measurement level. Did you get the same results? If so, your counting is reliable and probably valid. If you did not get the same results (similar doesn’t count), you’ve just met two of measurement’s problems: reliability and validity. We will return to both in a while, but for now think back on what you did the second time that might have created what we will call measurement error.

Basically, all measurement has some error in it. The job of good measurement is to keep that error as small as possible. This is done in several ways, from diligently creating a measurement system that can be employed by different people at different times that comes up with the same results each time and person to carefully looking at how that system is described or operationalized.

Measuring Public Relations Outcomes

Based on what has been discussed thus far, it should be clear that public relations outcomes that have direct impact on business goals and objectives are nonfinancial in nature. What this means is that public relations serves as a mediating or influencing factor on the final outcome of any campaign and its programming (outputs aimed at specific opinion leaders or influencers who will then serve as “third-party endorsers” or advocates of the campaign messages). These mediating factors are not something that are readily apparent, they are potentially observable. What this means is that public relations measures focus on the perceptions that publics and target audiences have of a business, company, brand, individual member, or whatever.

The problem with mediating variables is that they cannot be directly measured. Unlike financial-like data such as profits and losses, employee hours, number of products produced, public relations variables exist in the minds of those individuals to whom the public relations effort is being targeted. Thus, public relations measurement seeks to understand perceptions of the measurement object’s qualities—perceptions of product or company credibility, reputation, trust, and the perceived relationship and confidence in that relationship those individuals have with the object of measurement.

So, how does a public relations researcher measure something he or she cannot see? The answer is to do it as any social scientist would. Since the early 1900s, social scientists have measured with varying degrees of reliability and validity internal thoughts or perceptions (Thurstone and Chave 1929; Likert 1932; Dillard and Pfau 2002). What we know from this body of research is that behaviors can be inferred from people’s expressions about something. That is, we know that behaviors are influenced by awareness, knowledge, attitudes, beliefs, values, and intents. The problem is that all reside in our mind and although we can observe brain functioning through different forms of activity, we obviously do not have a way to peer into the deep recesses of the mind where awareness, knowledge, attitudes, beliefs, values, and intents reside. What social scientists do is to correlate actual behavior with intended behavior—actual behavior being the action taken by an individual and intended behavior being the expressed course of action an individual says he will do. What an individual says he will do is defined as his opinion on some future or past action. We know from anthropological, communication, psychological, and sociological studies that opinions are the expression of attitudes, which are defined as predispositions to act in some way. Attitudes in turn are based on belief systems, which are more primitive and allow for fairly simple internal processing. Beliefs are formed based on value systems, which are so basic that most people do not actually think about unless threatened (Dillard and Pfau 2002).

Thus, public relations researchers focus on creating measures that reflect the inner feelings and thoughts of their public or targeted audiences. Creating such measures requires an understanding of culture and language. Culture because it tends to create our value systems. Language because people express their cognitive and affective responses to something through whatever language is spoken and behavioral intentions because we can express intent. Furthermore, if measured carefully—which means that the measures are (a) measuring what we think they are measuring (are valid) and (b) do so time and again (are reliable)—they have been demonstrated to reflect actual behavior up to 60 percent of the time (Miller 2002), which occurs much higher than would be expected from chance alone.

So, public relations researchers assess outcomes based on systems of measurement that will predict behavior. The problem comes in determining whether those measures are valid and reliable. With financial indicators—much like count words on a page—the researcher can count and recount until he or she is certain that the number is correct. This is not possible with social and nonfinancial measures as there are a multitude of things that may be operating at different times when measuring them. Add to this the real problem of businesses creating and then not sharing measures due to the proprietary nature of business in general, and it becomes clear that there are many nonfinancial measures and that their creation and validation may not be open to all (hence public relations researchers often have to rely on academics to create basic instruments and then adapt them to the specific problem).

The problem, then, is creating reliable and valid measures of variables that will mediate the intended and then actual behavior of a public or target audience. Social scientists do this by creating attitudinal or belief scales, measures that collect data by asking questions, or making statements that respondents answer or react to. Notice that the plural form is used here—questions and statements. When someone responds to a single statement or answers a single question (an item), the research cannot be certain that the response is reliable. Its reliability cannot be judged, validity cannot be established.

Measuring and Validating Nonfinancial, Social Variables

There are many measures found in the social sciences literature that could be adapted to public relations. However, unlike academic research that can take months if not years to conduct, public relations professionals are rarely allowed much time to prepare and substantiate their measurement efforts. The rise of public relations research and measurement firms—some standalone and some as part of larger, multifunctional public relations agencies—has provided some movement toward the greater use of measurement. But with the proprietary nature of many of these measures, they are not often shared with other professionals. Reports are written and presented of findings employing these measures, but the actual measures and how they are computed, weighted, or assessed rarely find their way into print. Therefore, it is incumbent on the public relations professional to understand the basics of measuring social and nonfinancial variables. An informed client, after all, is the best client and being able to participate in establishing a measurement system—even if done so quickly to collect data on ongoing programs—should provide a better and more targeted measure.

Creating Nonfinancial or Social Measures

As far back as the early 19th century, social scientists have been creating measures of human behavior and measures that predict those behaviors. The earliest of those social scientists created what amounts to measures that assess attitudes, beliefs, and values. They do so through the use of language—that is, they understood that measures are dependent on the language people use to communicate and that not all languages have common meanings or phrases that are even similar to other languages. These measures are called measurement scales and are composed of items—statements or bipolar wording groups—that when added and averaged provide the outcome measure of interest. Since public relations measurement is often done through polls and surveys or through carefully constructed questionnaires, only three of the many approaches to attitude measurement are actually found.

Equal Appearing Interval Scale. The oldest measurement system was developed by Thurstone and Chave in 1929 (Thurstone and Chave 1929). In this system an attitude or belief object was measured by what they called “equal appearing intervals.” The intervals were actually numeric values of statements created and tested to range from one end of the attitudinal continuum to the other. Hence, items were created and tested on a large sample of people from the population to be measured. Hundreds of statements about the attitude object were created and participants asked to put each into one of 11 piles, from very unfavorable to very favorable for instance, and then the statement’s average was computed and its range of assigned favorableness was examined. From this analysis of hundreds of statements, a large number were determined to be valid and reliable items in a larger scale. A total of 50 to 60 of the items could then be given to participants who then simply indicated which they agreed with which were then summed and divided by the number they agree with to create a score.

The advantage to Thurstone and Chave’s measure is that the measure has validity for a large number of people based on pre-established values. Postadministration reliability should be high because of all the work done to prepare the scale items. The disadvantage comes from the amount of time it takes to create and validate the measure—something even more daunting in today’s social networking media.

Likert-Like Measures. In 1932 Rensis Likert reported on a different attitude measure that has become a staple of public relations measurement (Likert 1932). Likert’s system employed an odd-numbered response set to a simple statement to which respondents selected which response category best represented their feelings on the attitude object. Likert argued that with multiple statements focusing on the same attitude object that an “equal appearing interval” measure could be created that was reliable and valid. Where Likert differed from Thurstone and Chave was the creation of a midpoint between positive and negative responses. Furthermore, he stressed the need for each category to be a direct opposite of its linguistic partner. For instance, in the classic Likert-like scale, the opposites of “strongly agree” and “agree” are “disagree” and “strongly disagree.” The midpoint is “neither agree nor disagree.” However, the categories could just as easily be “excellent” and “good,” which would be opposed by “bad” and “terrible.” The problem comes with calling the attitude object “terrible”—something that most clients would prefer not knowing.

To become truly interval-level data, however, Likert argued that there must be multiple statements, at least two or three, and that they should be stated as degrees of opposition. For instance, the statement “I love brand X” would be opposed by “I hate brand X” and a middle-ground statement would be “I like brand X.” These would then be randomly placed with other items and presented to a respondent in a paper and pencil measure. If the respondent agreed with the first statement, he or she should disagree with the second and be somewhere in the middle of the third statement. This provides a sort of internal reliability that can be observed just through paper and pencil markings.

Likert-type measures have the advantage that they can be created quite quickly and have demonstrated consistent reliability if created systematically following the validation steps discussed earlier. There are problems with languages other than English in terms of direct translation; the traditional strongly agree to strongly disagree category system does not work in Spanish or in Semitic languages such as Hebrew.

Semantic Differential Measures. Finally, in 1957, Charles Osgood, George Suci, and Percy Tannenbaum produced a different measurement system that relied not on categories but on a true continuum as bounded by bipolar words or phrases, which they labeled the “semantic differential” (Osgood, Suci, and Tannenbaum 1957). Their work built on that of Thurstone and Chave and Likert and was a carefully conducted measurement program that identified a number of bipolar adjectives that were demonstrated to be valid and reliable measures of attitude. In particular, they found that a series of items could measure an attitude’s cognitive, affective, and activity (behavioral) dimensions. The format of the measure, however, has limited its use. The measure consists of lines of items bipolar terms whose terms have been randomly reversed to require that each be carefully read. For instance, one of the dimensions regularly employed included in the activity dimension uses three items: active– passive, sharp–dull, and fast–slow. Each would be separated by an odd number of spaces on the continuum between each and the respondent would read each and place a mark on the continuum where her perception of an attitude object was:

When used with paper and pencil responses, the semantic differential is a reliable measure of any attitude object or even a position stated as a sentence (e.g., “Health care should be universal for all Americans.”). It is important that the visual nature of measure be maintained, a problem with many web-based survey programs and it is almost impossible to use in a person-to-person interview or telephone survey.

Establishing Scale Reliability and Validity

As noted earlier in this chapter, measurement reliability and validity are major concerns when constructing a measurement system or instrument. Reliability means that what you are measuring will be measured the same way each time. Validity means that you are actually measuring what you say you are measuring. Interestingly, to establish validity, you must first establish reliability—a measure that is not reliable is never valid (Stacks 2017).

Reliability. A clock that is set five minutes fast and an alarm set for 7:00 in the morning should go off at 7:00 each morning if it is reliable. But the question then becomes, is it actually a valid measure of time? Obviously, the clock is reliable but is it 7:00 when the alarm goes off or is it actually 6:55, and how much “error” are you willing to accept. Furthermore, are you certain after only one “test” that the alarm will go off on time later? We first looked at this a few pages ago, when we talked about the problem with one-item measures. A response to a single “testing” can never establish reliability—the behavior (marking as with a pencil or physically observing a behavior) could be random or could be a clear observation of what happened); the problem is that we do not know and what we do not know we cannot explain. Hence, we need repeated observations stated in different ways to really ascertain a measure’s reliability.

We also know that unlike the hard sciences or in measures that do not involve humans (such as learning psychology’s white mice studies), humans due to the nature of their ability to construct abstraction rarely are 100 percent reliable over time. Thus, what measurement attempts to do is to establish what is considered a reliability index that runs from 0.00 (completely unreliable) to 1.00 (completely reliable). This index is expressed in what we know about—systematic error or known variance among measurement participants—and what we do not know—random error or unknown variance among measurement participants—in measurement, stated as systematic error divided by random error. Furthermore, if we take the reliability finding, square it, and then subtract it from 1.00 we have an index of what we know and what we do not know about a measurement’s reliability. Thus, if a measure is reported to have a reliability of 0.90, we find that 81 percent of the variance in responding to the measure is known or systematic error (“good” error, we can explain it) and 19 percent of the variance is still random error (“bad” error, we cannot explain it). So, even a measure that is 90 percent reliable is almost 20 percent unreliable. Ninety-five percent is a commonly used standard in public relations research. However, Stacks suggests that 90 percent or better reliability is excellent, 80 to 90 percent reliability is good, and that anything below 80 percent requires that the measure be approached with caution (Stacks 2017).

Validity. Validity—whether we are measuring what we think we are measuring—is more than a philosophical concern. If we think we are measuring reputation but instead are measuring something else, then all the results—the hard data obtained from which to correlate other, financial indicators—will be worthless. In academia, where time and subjects (students who typically gain course credit for participating in a measure’s creation and validation) are plentiful, a measure is created in four basic steps, each addressing validity first and second, then reliability comes back in at step three and finally step four provides indicators of actual usefulness.

The first step in any measurement system is to do the necessary secondary research to understand just exactly what it is that is to be measured. This due diligence step produces face validity, or validity based on the measurement researcher’s knowledge of the area and of other extant measures. The measure at this step is only as valid as the amount of research time put into secondary research (to include studying about measurement—psychometrics) and the researcher’s own credibility. Once the researcher has completed a large set of potential items she turns to the second step for what is called content validity. Step two requires that others who have knowledge in the attitude object examine each item to ensure that the items do indeed relate to what the researcher thinks she is going to measure. Furthermore, because individual researchers often do not see where conflicts or contradictions may occur, this panel of experts can point out where items are poorly stated—they may not relate to the attitude object or they may be double-barreled—have two or more possible meanings so that responses to the item are never fully understood; they usually are words or phrases joined by “and,” “but,” or “or.” Once the researcher has examined the item pool and the evaluation of the expert panel, she may turn to step three.

Step three examines the measure for its construct validity. Construct validity deals with how respondents actually see the entire measurement scale. People evaluate attitude objects on three dimensions: what they think of them (cognitive); how they react to them (affective); and how they plan on acting toward them (connotative or sometimes labeled as “behavioral”). Step three requires that the items left from step two be randomized and given to a fairly large number of respondents. The measurement scale is then coded into a computer and the results from the scale’s items then submitted to statistical testing to ascertain if the measure “falls out” the same way as the researcher intended.1 If so, and at this phase of step three, the items that are kept in the scale can be submitted to reliability analysis. If the reliabilities are 0.80 or better (0.70 if the measure will be tested again), the fourth step is analyzed to see if the measure actually is measuring outcomes similar to what should be expected. For instance, if the scale is measuring credibility, how does it correlate to different measures? If there is an event that is being evaluated, do known groups respond the same way as expected—for instance die-hard Republicans reporting Sara Palin’s credibility is high, while moderate Republicans or Democrats reporting her credibility low. Comparison provides the fourth step of validity—criterion-related validity.

Reliability and Validity. As noted, there is a relationship between a measure’s reliability and its validity. Obviously, a measure that is not reliable will never be valid; however, a valid measure may not be reliable due to problems with the measure’s wording or other factors that may reduce reliability, such as testing situations, participant language abilities, and so forth.

Extant Measures

How does a public relations professional deal with the 24/7 nature of his job from a measurement perspective, especially with more pressure being placed on him to “demonstrate ROI?” There are hundreds of social variable measures in the academic literature. Indeed, Delbert Miller’s Handbook provides multiple measures used in the social sciences, provides background on their development, and reliability and validity information (Miller 2002). In communication, where nonfinancial indicators relating to credibility, relationship, reputation, and trust abound, two books are excellent sources for measurement scales that can be adapted to whatever the current problem is (Rubin, Rubin, and Haridakisk 2010; Rubin, Palmgreen, and Sypher 1994). Finally, scholarly journals often publish the scales and items employed in published articles and if not the authors can be contacted to provide the scales employed in their studies.

There are two sets of standard measures. First, there are intermediary measures. These measures focus on three specific things that are found in outtakes:

  1. Whether the presence of basic facts are actually found in third-party advocacy or in stories and messages published or aired in the social and traditional media.

  2. The presence of misstatements or erroneous information in such messages and stories.

  3. The absence or omission of basic facts that should be included in a complete story.

Additional intermediary measures can include topics from the out-takes, the presence of spokespersons as well as the overall sentiment or tone about the subject of the article or online posting.

These measures typically employ content analysis and the data are nominal—included or not included—or ordinal—positive, neutral, negative—in nature as found in the stories and messages obtained through the social or traditional media.

Second, there are target audience measures, which take the form of outtake and outcome measures. There are six outcome areas: awareness, knowledge, interest, relationship, preference or intent, and advocacy. These measures have become the standard against which other measures can be evaluated (Michaelson and Stacks 2011) and appears in Appendix A of this book. Each measure describes how the data are to be collected and offers prototype questions and responses categories.

Case: The Multiplier Studies

As a case in point, and one that demonstrates combining academic researchers and professional measurement researchers often produce superior results. We will look at three studies that sought to test the long-held assumption that public relations produced X-times more outcome than advertising—or a first test of the multiplier effect (Stacks and Michaelson 2004; Michaelson and Stacks 2007; Stacks and Michaelson 2009). This case was chosen because your authors conducted it and can explain why certain measures were created and the outcomes. The theoretical rationale and study design are available in Public Relations Journal; we will focus on the measurement questions and outcomes.

The first study asked a limited number of students to respond to either an advertisement or print editorial copy for one of three products (bottled water, bandage, and flu medication) across four media (print editorial, print advertisement, radio advertisement, web page advertisement) and then evaluate that product as to its credibility (believability) and intent to purchase via a paper-and-pencil self-administered test. These variables were defined initially as traditional business-type forced choice measures, except that a middle point was added (“neither good nor bad”) and respondents who failed to complete an item where coded as RTA. Analyses found no differences across media for any of the three products or against a group who received no stimulus and only the evaluative measures. However, due to the nature of the one-item “measures,” we could not be certain if the findings were due to the fact that there was no multiplier effect or respondents marking behavior was unreliable. Furthermore, the study’s employment of multiple attitude objects or brands may have impacted on the outcomes of interest. Discussion of the study at the 2004 Measurement Summit also pointed out problems with self-administered measures, student populations, and a small number of participants.

Therefore, approaching the revised study as if it were a project for a client (and a client actually came forth and funded the follow-up studies but wished to remain anonymous), we rethought the study from both design and measurement perspectives. The new study differed significantly from what we now called the “pilot study.” First, by looking at it as a best practices campaign, we created a new product brand—one that would not have any history or baggage attached to it and one that fit into a series of other brands in similar product lines: Zip Chips, a health snack. However, we wondered, what outcomes would demonstrate a public relations multiplier effect over advertising for a product with no history. A review of marketing and social science literature focused our attention on several outcome nonfinancial variables: credibility, homophily (degree of similarity on an attitude or behavior, brand knowledge), and image as mediating factors that would predict intent to purchase.

Credibility was further defined as brand authoritativeness and character. Each submeasure was defined by multiple items responded to on a Likert-type strongly agree to strongly disagree continuum. The credibility statements for authority employed were:

  • The product has been presented honestly.

  • Based on what I know of it, this product is very good.

  • This product is very consumer unfriendly.

  • Based on what I know of it, I find this product quite pleasant to use.

  • This product is awful.

The statements for character were:

  • Based on what I know of it, this product is an excellent choice for me.

  • This product is a value for its price.

  • I think this product is very reliable.

The homophily statements were adapted from a measure developed by McCroskey, Richmond, and Daly (1975) known as the Perceived Homophily Measure. Homophily measures the degree of similarity between people on an attitude and in our case was adapted to provide measures of attitudinal and behavioral similarity. All items were responded to on a Likert-type strongly agree to strongly disagree continuum. Attitudinal homophily was measured on the following items:

  • This product is something that is like me.

  • People who buy this product are very much like me.

  • I would purchase this product because it reflects my lifestyle.

Behavioral homophily was measured on the following items:

  • This product is used by people in my economic class.

  • This product reflects my social background.

  • People who use this product are culturally similar to me.

In addition, participants were asked to compare their knowledge and awareness of the Zip Chip brand against other chip brands. Finally, they were asked a series of questions assessing their knowledge of the brand, how the brand compared to other brands in the same product class, and their intent to purchase Zip Chips.

Three hundred fifty-one shoppers located in six malls in major cities representative of the 48 contiguous United States were randomly selected to participate. All were first screened for age and newspaper readership (the editorial and ad were limited to print media) and were exposed to either the advertisement, the editorial as would have been seen in The New York Times, or to a control group who received only the measurement instrument. Instead of reading a testing packet, all participants were interviewed about their attitudes toward Zip Chips by trained interviewers in situ. The data were then coded and statistically analyzed. The first analyses were to establish the psychometric validity of the measures and then their reliabilities. The results found the measurement scales to possess the expected dimensions and with good or better reliabilities.

The results found no major differences across the study, with the exception of the homophily outcomes, which were significantly higher for those reading the public relations copy than those who saw the advertisement. Furthermore, an analysis of the “don’t know” responses to the outcome measures found that those who were exposed to the public relations copy were less unsure of themselves than those exposed to the advertisement copy.

This study was presented in several venues and published on the Institute for Public Relations website. Discussion focused not on the measures, but on the single stimulus presentation. Therefore, a follow-up study using the same outcome measures, but with the advertisement and the print editorial imbedded in a full-page The New York Times spread was shown to 651 nationwide at the same 6 malls. The findings were similar, but of more importance from a measurement perspective; the nonfinancial outcome measures were found both reliable and valid.

This case demonstrates that public relations outcomes can be measured and that those measuring instruments or scales can be created quite quickly. Furthermore, the basic steps in establishing their validity and reliability can be completed quickly, but only if the measurement researcher has a good understanding of the outcome variables and how they might be adapted to a measurement program within a campaign. The next chapter introduces the concept of secondary research and analysis as a method that helps inform the measurement, assessment, and evaluation process.

Summary

This chapter has taken the reader through the process of creating reliable and valid measures of nonfinancial indicators. It began by setting the stage for public relations measurement that correlates to business outcomes. After a short discussion of the types of variables public relations can measure, it focused on setting up those measures as four different levels of data and what each level added to our ability to establish public relations impact on final return on investment. The chapter then focuses on different kinds of measures appropriate for public relations nonfinancial measurement and ended with a measurement case.

Understanding both the kind and type of data being used is important when reporting research findings. Each of the measurement metrics discussed (and demonstrated in Appendix A) provides important advantages and disadvantages. These may take into consideration how the scale items are presented, the actual reliability of measures employed, and how validity has been established. When asked, the researcher should be able to respond to any question with a solid interpretation of why the measures were used, how they were created, and how reliability and validity were established.

1 This is done through a statistical test called Factor Analysis. Although way beyond the scope of this volume, Factor Analysis takes the items in a scale and tests to see how they are related to each other. A “factor” or “dimension” emerges from the correlations that have been “stretched” to ensure maximum relationship and for other items that appear to be close to be truly within that dimension. There are two types of Factor Analysis—the one that would be used if creating a new measure is called Exploratory Factor Analysis (EFA). The second, which analyses the factor structure of an extant measure, is called Confirmatory Factor Analysis (CFA). Regardless of whether being created or whether an existing measure is used, Factor Analysis should be conducted on participant responses to the measure. EFA and CFA is not reliability analysis, although many people will confuse one for the other.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.217.181.166