Chapter 4
IN THIS CHAPTER
Summarizing categorical data effectively
Interpreting commonly used statistics
Realizing what statistics do and don’t say
Every data set has a story, and if statistics are used properly, they do a good job of uncovering and reporting that story. Statistics that are improperly used can tell a different story, or only part of it, so knowing how to make good decisions about the information you’re given is very important.
A descriptive statistic (or statistic for short) is a number that summarizes or describes some characteristic about a set of data. In this chapter and the next one, you see some of the most common descriptive statistics and how they are used, and you find out how to calculate them, interpret them, and put them together to get a good picture of a data set. You also find out what these statistics say and what they don’t say about the data.
Descriptive statistics take a data set and boil it down to a set of basic information. Summarized data are often used to provide people with information that is easy to understand and that helps answer their questions. Picture your boss coming to you and asking, “What’s our client base like these days, and who’s buying our products?” How would you like to answer that question? With a long, detailed, and complicated stream of numbers that are sure to glaze their eyes over? Probably not. You want clean, clear, and concise statistics that sum up the client base for them, so they can see how brilliant you are and then send you off to collect even more data to see how they can include more people in the client base. (That’s what you get for being efficient.)
Summarizing data has other purposes, as well. After all the data have been collected from a survey or some other kind of study, the next step is for the researcher to try to make sense out of the data. Typically, the first step researchers take is to run some basic statistics on the data to get a rough idea about what’s happening in it. Later in the process, researchers can do more analyses to formulate or test claims made about the population the data came from, estimate certain characteristics about the population (like the mean), look for links between variables they measured, and so on.
Another big part of research is reporting the results, not only to their peers, but also to the media and the general public. Although a researcher’s peers may be anxiously waiting to hear about all the complex analyses that were done on a data set, the general public is neither ready for nor interested in that. What does the public want? Basic information. Statistics that make a point clearly and concisely are typically used to relay information to the media and to the public.
Categorical data (also known as qualitative data) capture qualities or characteristics about an individual, such as a person’s eye color, gender, political party, or opinion on some issue (using categories such as Agree, Disagree, or No Opinion). Categorical data tend to fall into groups or categories pretty naturally. “Political Party,” for example, typically has four groups in the United States: Democrat, Republican, Independent, and Other. Categorical data often come from survey data, but they can also be collected in experiments. For example, in an experimental test of a new medical treatment, researchers may use three categories to assess the outcome of the experiment: Did the patient get better, worse, or stay the same while undergoing the treatment?
One way to summarize categorical data is to simply count, or tally up, the number of individuals that fall into each category. The number of individuals in any given category is called the frequency for that category. If you list all the possible categories along with the frequency for each of them, you create a frequency table. The total of all the frequencies should equal the size of the sample (because you place each individual in one category).
Q. Suppose you take a sample of ten people and ask them all whether or not they own the latest smartphone. Each person falls into one of two categories: yes or no. The data is shown in the following table:
Person # | Latest Smartphone | Person # | Latest Smartphone |
---|---|---|---|
1 | Y | 6 | Y |
2 | N | 7 | Y |
3 | Y | 8 | Y |
4 | N | 9 | N |
5 | Y | 10 | Y |
A. Data summaries boil down the data quickly and clearly.
Own the Latest Smartphone? | Frequency |
---|---|
Y | 7 |
N | 3 |
Total | 10 |
1 You survey 20 shoppers to see what type of soft drink they like best, Brand A or Brand B. The results are: A, A, B, B, B, B, B, B, A, A, A, B, A, A, A, A, B, B, A, A. Which brand do the shoppers prefer? Make a frequency table and explain your answer.
2 A local city government asks voters to vote on a tax levy for the local school district. A total of 18,726 citizens vote on the issue. The Yes count comes in at 10,479, and the rest of the voters say No.
3 A zoo asks 1,000 people whether they’ve been to the zoo in the last year. The pollsters count the results and find that 592 say yes, 198 say no, and 210 don’t respond.
Another way to summarize categorical data is to show the percentage of individuals that fall into each category, thereby creating a relative frequency. The relative frequency of a given category is the frequency (number of individuals in that category) divided by the total sample size, multiplied by 100 to get the percentage; hence, the calculated value is relative to the total surveyed. For example, if you survey 50 people and 10 are in favor of a certain issue, the relative frequency of the “in-favor” category is 10 ÷ 50 = 0.20 times 100, which gives you 20 percent. If you list all the possible categories along with their relative frequencies, you create a relative frequency table. The total of all the relative frequencies should equal 100 percent (subject to possible round-off error).
Q. Using the cellphone data from the following table, make a relative frequency table and interpret the results.
Person # | Latest Smartphone | Person # | Latest Smartphone |
---|---|---|---|
1 | Y | 6 | Y |
2 | N | 7 | Y |
3 | Y | 8 | Y |
4 | N | 9 | N |
5 | Y | 10 | Y |
A. Following is a relative frequency table for the cellphone data. Seventy percent of the people sampled reported owning the latest smartphones, and 30 percent admitted to being technologically just a little behind the times.
Own the Latest Smartphone? | Relative Frequency |
---|---|
Y | 70% |
N | 30% |
You get the 70 percent by taking , and you calculate the 30 percent by taking .
4 You survey 20 shoppers to see what type of soft drink they like best, Brand A or Brand B. The results are: A, A, B, B, B, B, B, B, A, A, A, B, A, A, A, A, B, B, A, A. Which brand do the shoppers prefer?
5 A local city government asked voters in the last election to vote on a tax levy for the local school district. A record 18,726 voted on the issue. The Yes count came in at 10,479, and the rest of the voters checked the No box. Show the results in a relative frequency table.
6 A zoo surveys 1,000 people to find out whether they’ve been to the zoo in the last year. The pollsters count the results and find that 592 say yes, 198 say no, and 210 don’t respond. Make a relative frequency table and use it to find the response rate (percentage of people who respond to the survey).
7 Suppose that instead of showing the number in each group, you show just the percentage (that is, the relative frequency).
You can break down categorical data further by creating two-way tables. Two-way tables (also called crosstabs) are tables with rows and columns. They summarize the information from two categorical variables at once, such as gender and political party, so you can see (or easily calculate) the percentage of individuals in each combination of categories and use them to make comparisons between groups.
For example, if you had data about the gender and political party of your respondents, you would be able to look at the percentage of Republican females, Republican males, Democratic females, Democratic males, and so on. In this example, the total number of possible combinations in your table would be , or the total number of gender categories times the total number of party affiliation categories. (See Chapter 20 for the full scoop, and then some, on two-way tables.)
The U.S. government calculates and summarizes loads of categorical data using crosstabs. Typical age and gender data, reported by the U.S. Census Bureau for a survey conducted in 2020, are shown in Table 4-1. (Normally age would be considered a numerical variable, but the way the U.S. government reports it, age is broken down into categories, making it a categorical variable.)
Table 4-1 U.S. Population, Broken Down by Age and Gender
Age Group | Total (in Mil.) | % | # Males (in Mil.) | % | # Females (in Mil.) | % |
---|---|---|---|---|---|---|
< 5 | 19.30 | 5.86 | 9.86 | 6.07 | 9.44 | 5.65 |
5–9 | 20.24 | 6.14 | 10.35 | 6.38 | 9.89 | 5.91 |
10–14 | 20.75 | 6.30 | 10.59 | 6.53 | 10.16 | 6.08 |
15–19 | 20.96 | 6.36 | 10.69 | 6.59 | 10.27 | 6.14 |
20–24 | 21.59 | 6.55 | 11.03 | 6.80 | 10.56 | 6.32 |
25–29 | 23.24 | 7.05 | 11.88 | 7.32 | 11.36 | 6.79 |
30–34 | 22.84 | 6.93 | 11.57 | 7.13 | 11.27 | 6.74 |
35–39 | 21.83 | 6.63 | 10.94 | 6.74 | 10.89 | 6.51 |
40–44 | 20.31 | 6.16 | 10.11 | 6.23 | 10.20 | 6.10 |
45–49 | 19.97 | 6.06 | 9.87 | 6.08 | 10.10 | 6.04 |
50–54 | 20.39 | 6.19 | 10.05 | 6.19 | 10.34 | 6.18 |
55–59 | 21.60 | 6.56 | 10.51 | 6.48 | 11.09 | 6.63 |
60–64 | 20.80 | 6.32 | 9.98 | 6.15 | 10.82 | 6.47 |
65–69 | 17.87 | 5.42 | 8.39 | 5.17 | 9.48 | 5.67 |
70–74 | 14.67 | 4.45 | 6.79 | 4.19 | 7.88 | 4.71 |
75–79 | 9.98 | 3.03 | 4.47 | 2.76 | 5.51 | 3.30 |
80–84 | 6.47 | 1.97 | 2.75 | 1.70 | 3.72 | 2.22 |
85– | 6.65 | 2.02 | 2.41 | 1.49 | 4.24 | 2.54 |
Total | 329.46 | 100 | 162.24 | 100 | 167.22 | 100 |
You can examine many different facets of the U.S. population by looking at and working with different numbers from Table 4-1. For example, looking at gender, you notice that women slightly outnumber men — the population in 2020 was 50.76 percent female (divide the total number of females by the total population size and multiply by 100 percent) and 49.24 percent male (divide the total number of males by the total population size and multiply by 100 percent). You can also look at age: The percentage of the entire population that is under 5 years old was 5.86 percent (divide the total number under age 5 by the total population size and multiply by 100 percent). The largest group belongs to the 25- to 29-year-olds, who made up 7.05 percent of the population.
Next, you can explore a possible relationship between gender and age by comparing various parts of the table. You can compare, for example, the percentage of females to males in the 80-and-over age group. Because these data are reported in 5-year increments, you have to do a little math in order to get your answer, though. The percentage of the population that’s female and aged 80 and above (looking at column 7 of Table 4-1) is . The percentage of males aged 80 and over (looking at column 5 of Table 4-1) is . This shows that the 80-and-over age group for the females is about 49 percent larger than for the males (because ).
These data confirm the widely accepted notion that women tend to live longer than men. However, the gap between men and women is narrowing over time. According to the U.S. Census Bureau, back in 2001 the percentage of women who were 80 years old and over was 4.36, compared to 2.31 for men. The females in this age group outnumbered the males by a whopping 89 percent back in 2001 (noting that ).
After you have the crosstabs that show the breakdown of two categorical variables, you can conduct hypothesis tests to determine whether a significant relationship or link between the two variables exists, taking into account the fact that data vary from sample to sample. Chapter 15 gives you all the details on hypothesis tests.
Not all summaries of categorical data are fair and accurate. Knowing what to look for can help you keep your eyes open for misleading and incomplete information.
Instructors often ask you to “interpret the results.” Your instructor wants you to use the statistics available to talk about how they relate to the given situation. In other words, what do the results mean to the person who collects the data?
With relative frequency tables, don’t forget to check whether all categories sum to 1, or 100 percent (subject to round-off error), and remember to look for some indicator as to total sample size. See the following for an example of critiquing a data summary.
Q. You watch a commercial where the manufacturer of a new cold medicine (“Nocold”) compares it to the leading brand. The results are shown in the following table.
How Nocold Compares | Percentage |
---|---|
Much better | 47% |
At least as good | 18% |
A. Much like the cold medicines we always take, the table about “Nocold” does “Nogood.”
8 Suppose you ask 1,000 people to identify from a list of five vacation spots which ones they’ve already visited. The frequencies you receive are as follows: Disney World, 216; New Orleans, 312; Las Vegas, 418; New York City, 359; and Washington, D.C., 188.
9 If you have only a frequency table, can you find the corresponding relative frequency table? Conversely, if you have only a relative frequency table, can you find the corresponding frequency table? Explain your answer.
1 Eleven shoppers prefer Brand A, and nine shoppers prefer Brand B. The survey results are shown in the following frequency table. Brand A got more votes, but the results are pretty close.
Brand Preferred | Frequency |
---|---|
A | 11 |
B | 9 |
Total | 20 |
2 Frequencies are fine for summarizing data as long as you keep the total number in perspective.
Vote | Frequency |
---|---|
Y | 10,479 |
N | 8,247 |
Total | 18,726 |
3 This problem shows the importance of reporting not only the results of participants who responded, but also what percentage of the total actually responded.
Gone to the Zoo in the Last Year? | Frequency |
---|---|
Y | 592 |
N | 198 |
Nonrespondents | 210 |
Total | 1,000 |
4 Relative frequencies do just what they say: They help you relate the results to each other (by finding percentages).
Brand Preferred | Relative Frequency |
---|---|
A | 55% |
B | 45% |
5 The results are shown in the following relative frequency table. The Yes percentage is percent. Because the total is 100 percent, the No percentage is .
Vote | Relative Frequency |
---|---|
Y | 55.96% |
N | 44.04% |
6 You can see the relative frequency table following this answer. Knowing the response rate is critical for interpreting the results of a survey. (The higher the response rate, the better.) The response rate is , that is, the total percentage of people who responded in any way (yes or no) to the survey. Note that 21 percent is the nonresponse rate.
Gone to the Zoo in the Last Year? | Relative Frequency |
---|---|
Y | |
N | 19.8% |
Nonrespondents | 21.0% |
7 Showing the percentages rather than counts means making a relative frequency table rather than a frequency table.
When making a relative frequency table, include the total sample size somewhere on the table.
8 Be careful how you interpret tables where an individual can be in more than one category at the same time.
Location | % Who Have Been There | % Who Haven’t Been There |
---|---|---|
Disney World | ||
New Orleans | ||
Las Vegas | ||
New York City | ||
Washington, D.C. |
Not all tables involving percents should sum to 1. Don’t force tables to sum to 1 when they shouldn’t; do make sure you understand whether each individual can fall under more than one category. In those cases, a typical relative frequency table isn’t appropriate.
9 You can always sum all the frequencies to get a total and then find each relative frequency by taking the frequency divided by the total. However, if you have only the percents, you can’t go back and find the original counts unless you know the total number of individuals.
Suppose you know that 80 percent of the people in a survey like ice cream. How many actual people in the survey like ice cream? If the total number of respondents is 100, then people like ice cream. If the total is 50, then you’re looking at positive answers. If the total is 5, then you’re dealing only with 4 respondents . This illustrates why relative frequency tables need to have the total sample size somewhere.
Watch for total sample sizes when you’re given a relative frequency table. Don’t be misled by percentages alone, thinking they’re always based on large sample sizes. Many are not.
If you’re ready to test your skills a bit more, take the following chapter quiz that incorporates all the chapter topics.
Quiz time! Complete each problem to test your knowledge on the various topics covered in this chapter. You can then find the solutions and explanations in the next section.
1 A researcher orders 10 of the exact same pizzas for delivery from 10 different pizzerias on a given night. They note whether or not each pizza came on time and record the results in the following table.
Pizza # | On Time? | Pizza # | On Time? |
---|---|---|---|
1 | N | 6 | N |
2 | N | 7 | Y |
3 | Y | 8 | Y |
4 | N | 9 | N |
5 | Y | 10 | Y |
2 Bob surveys 10 of his real-estate clients at random to see whether they are thinking of selling in the next year. The results are shown in the following table.
Person # | Selling | Person # | Selling |
---|---|---|---|
1 | N | 6 | N |
2 | N | 7 | N |
3 | Y | 8 | Y |
4 | N | 9 | N |
5 | Y | 10 | Y |
3 Is the following table a frequency table or a relative frequency table? Why?
Voted ( people surveyed) | Results |
---|---|
Y | 32% |
N | 68% |
4 Ten winter days were randomly chosen in the month of January in Columbus, Ohio. Each day, it was noted whether the temperature dipped below 32 or not. On 30% of the days, it did drop below 32, and on 70% of the days, it didn’t.
5 Professor Charleston’s final grades for his class turned out to have 10 As, 8 Bs, 5 Cs, 2 Ds, and 0 Fs.
6 Explain how a pie chart is similar to a relative frequency table.
7 Susan gave a survey to 200 randomly chosen students in her statistics class. She asked the students whether they planned to go away from home for spring break. One hundred students said yes, 60 students said no, and 40 students did not respond.
8 Hasan the quality control manager takes a random sample of 20 packages of frozen shrimp off the shelf. He determines whether each package has expired (yes, no), and checks to see if it has the proper weight (yes, no). He places the results in a table. What type of table will Hasan’s results be summarized in?
9 A survey asks for political party (Republican, Democrat, Independent, Other) and age group (18–30, 31–50, over 50) of 150 randomly chosen registered voters. How many categories (cells) will be in the resulting two-way table?
1 Yes; No. Relative frequency table:
Pizzas | On Time? Yes | On Time? No |
Percent (Relative Frequency) | 50% | 50% |
2
Clients | Thinking of Selling? Yes | Thinking of Selling? No |
Count (Frequency) | 4 | 6 |
3
4 The table is a relative frequency table because the results are shown in percentage form, and sum to .
5
Days | Dropped below 32? Yes | Dropped below 32? No |
Count (Frequency) |
6
Grade | A | B | C | D | F |
Relative Frequency |
7 A pie chart is similar to a relative frequency table because each slice of the pie is a group and each group has a percentage attached to it that indicates the percentage of individuals in that group. The percents sum to 1. A pie chart is a graphical representation of a relative frequency table.
8
Yes | No | No Response | |
Percent (Relative Frequency) |
9 A two-way table with rows and columns representing expired (yes, no), and proper weight (yes, no). One possible two-way table is the following:
Proper Weight? Yes | Proper Weight? No | |
Expired? Yes | ||
Expired? No |
10 The number of cells will be 4 (political parties) * 3 (age groups) = 12 category combinations = 12 cells.
3.137.223.190