Chapter 7

Going by the Numbers: Graphing Numerical Data

IN THIS CHAPTER

Bullet Making and interpreting histograms and boxplots for numerical data

Bullet Examining time charts for numerical data collected over time

Bullet Gaining strategies for spotting misleading and incorrect graphs

The main purpose of charts and graphs is to summarize data and display the results to make your point clearly, effectively, and correctly. In this chapter, I present data displays used to summarize numerical data — data that represent counts (such as the number of pills a patient with diabetes takes per day, or the number of accidents at an intersection per year) or measurements (the time it takes you to get to work/school each day, or your blood pressure).

You see examples of how to make, interpret, and evaluate the most common data displays for numerical data: time charts, histograms, and boxplots. I also point out many potential problems that can occur in these graphs, including how people often misread what’s there. This information will help you develop important detective skills for quickly spotting misleading graphs.

Handling Histograms

A histogram provides a snapshot of all the data broken down into numerically ordered groups, making it a quick way to get the big picture of the data, in particular, its general shape. In this section you find out how to make and interpret histograms, and how to critique them for correctness and fairness.

Making a histogram

A histogram is a special graph applied to data broken down into numerically ordered groups, for example, age groups such as 10–20, 21–30, 31–40, and so on. The bars connect to each other in a histogram — as opposed to a bar graph (see Chapter 6) for categorical data, where the bars represent categories that don’t have a particular order and are separated. The height of each bar of a histogram represents either the number of individuals (called the frequency) in each group or the percentage of individuals (the relative frequency) in each group. Each individual in the data set falls into exactly one bar.

Remember You can make a histogram from any numerical data set; however, you can’t determine the actual values of the data set from a histogram because all you know is which group each data value falls into.

An award-winning example

Here’s an example of how to create a histogram for all you movie lovers out there (especially those who love old movies). The Academy Awards started in 1928, and one of the most popular categories for this award is Best Actress in a Motion Picture. Table 7-1 shows the winners of the first eight Best Actress Oscars, the years they won (1929–1936), their ages at the time of winning their awards, and the movies they were in. From the table, you see that the ages cover a range from 22 to 63 — much wider than you may have thought it would be.

TABLE 7-1 Ages of Best Actress Oscar Award Winners 1929–1936

Year

Winner

Age

Movie

1929

Laura Gainor

22

Sunrise

1930

Mary Pickford

37

Coquette

1931

Norma Shearer

28

The Divorcee

1932

Marie Dressler

63

Min and Bill

1933

Helen Hayes

32

The Sin of Madelon Claudet

1934

Katharine Hepburn

26

Morning Glory

1935

Collette Colbert

31

It Happened One Night

1936

Bette Davis

27

Dangerous

To find out more about the ages of the Best Actress winners, I expanded my data set to the period 1929 to 2021. The age variable for this data set is numerical, so you can graph it using a histogram. From there you can answer questions like: What do the ages of these actresses look like? Are they mostly young, old, or in between? Are their ages all spread out, or are they similar? Are most of them in a certain age range, with a few outliers (either very young or very old actresses, compared to the others)? To investigate these questions, a histogram of ages of the Best Actress winners is shown in Figure 7-1.

A histogram depicts the Best Actress Academy Award winners� ages, 1929�2021.

FIGURE 7-1: Histogram of Best Actress Academy Award winners’ ages, 1929–2021.

Notice that the age groups are shown on the horizontal (x) axis. They go by groups of five years each: 20–25, 25–30, 30–35, … , 80–85. The percentage (relative frequency) of actresses in each age group appears on the vertical (y) axis. For example, about 27 percent of the actresses were between 25 and 30 years of age when they won their Oscars.

Creating appropriate groups

Tip For Figure 7-1, I used groups of five years each in the preceding example because increments of five create natural breaks for years and because this grouping provides enough bars to look for general patterns. You don’t have to use this particular grouping, however; you have a bit of creative license when making a histogram. (However, this freedom also allows others to deceive you, as you see in the later section, “Detecting misleading histograms.”) Here are some tips for setting up your histogram:

  • Each data set requires different ranges for its groupings, but you want to avoid ranges that are too wide or too narrow.
  • If a histogram has really wide ranges for its groups, it places all the data into a very small number of bars that make meaningful comparisons impossible.
  • If the histogram has very narrow ranges for its groups, it looks like a big series of tiny bars that cloud the big picture. This can make the data look very choppy with no real pattern.
  • Make sure your groups have equal widths. If one bar is wider than the others, it may contain more data than it should.

One idea that may be appropriate for your histogram is to take the range of the data (largest minus smallest) and divide by 10 to get 10 groupings.

Handling borderline values

In the Academy Award example, what happens if an actress’s age lies right on a borderline? For example, Olivia de Havilland was 30 years old in 1947 when she won the Oscar for To Each His Own. Does she belong in the 25–30 age group (the lower bar) or the 30–35 age group (the upper bar)?

Remember As long as you are consistent with all the data points, you can either put all the borderline points into their respective lower bars or put all of them into their respective upper bars. The important thing is to pick a direction and be consistent. In Figure 7-1, I went with the convention of putting all borderline values into their respective lower bars — which puts Olivia de Havilland’s age in the second bar, the 25–30 age group of Figure 7-1.

Clarifying the axes

The most complex part of interpreting a histogram for the reader is to get a handle on what’s being shown on the x and y axes. Having good descriptive labels on the axes helps. Most statistical software packages label the x-axis using the variable name you provide when you enter your data (for example, “age” or “weight”). However, the label for the y-axis isn’t as clear. Statistical software packages often label the y-axis of a histogram by writing “frequency” or “percent” by default. These terms can be confusing: frequency or percentage of what?

Tip Clarify the y-axis label on your histogram by changing “frequency” to “number of” and adding the variable name. To modify a label that simply reads “percent,” clarify by writing “percentage of” and the variable. For example, in the histogram of ages of the Best Actress winners shown in Figure 7-1, I labeled the y-axis “Percentage of actresses in each age group.” In the next section, you see how to interpret the results from a histogram. How old are those actresses anyway?

ExampleQ. Test scores for a class of 30 students are shown in the following table.

Scores

Frequency

70–79

8

80–89

16

90–99

6

  1. Make a frequency histogram.
  2. Find the relative frequencies for each group.
  3. Without actually drawing it, how would the relative frequency histogram compare to the frequency histogram?

A. Frequency histograms and relative frequency histograms look the same; they’re just done using different scales on the y-axis.

  1. The frequency histogram for the scores data is shown in the following figure.
    A histogram depicts the test scores frequency.
  2. You find the relative frequencies by taking each frequency and dividing by 30 (the total sample size). The relative frequencies for these three groups are mathor 27 percent; mathor 53 percent; and mathor 20 percent, respectively.
  3. A histogram based on relative frequencies looks the same as the histogram (of the same data). The only difference is the label on the y-axis.

1Yourturn You lose information from the data when you create a histogram. What information is lost?

2 Make a histogram from this data set of test scores: 72, 79, 81, 80, 63, 62, 89, 99, 50, 78, 87, 97, 55, 69, 97, 87, 88, 99, 76, 78, 65, 77, 88, 90, and 81. Would a pie chart be appropriate for this data?

3 Suppose you take a survey of 45 homeowners to find out how many televisions they own. After you finish, you find that 2 people own no TVs, 17 people own one, 22 people own two, 3 own three, and 1 owns four. Make a relative frequency histogram of this data and interpret the results.

4 Suppose you have a loaded die. You roll it several times and record the outcomes, which are shown in the following figure.

A histogram depicts the loaded die data.
  1. Make a relative frequency histogram of these results.
  2. You can make a relative frequency histogram from a frequency histogram; can you go in the other direction?

Interpreting a histogram

A histogram tells you three main features of numerical data:

  • How the data are distributed among the groups (statisticians call this the shape of the data)
  • The amount of variability in the data (statisticians call this the amount of spread in the data)
  • Where the center of the data is (statisticians use different measures)

Checking out the shape of the data

One of the features that a histogram can show you is the shape of the data — in other words, the manner in which the data fall into the groups. For example, all the data may be exactly the same, in which case the histogram is just one tall bar; or the data may have an equal number in each group, in which case the shape is flat.

Some data sets have a distinct shape. Here are three shapes that stand out.

  • Symmetric: A histogram is symmetric if you cut it down the middle and the left-hand and right-hand sides resemble mirror images of each other.

    Figure 7-2a shows a symmetric data set; it represents the amount of time each of 50 survey participants took to fill out a certain survey. You see that the histogram is close to symmetric.

  • Skewed right: A skewed-right histogram looks like a lopsided mound, with a tail going off to the right.

    Figure 7-1 from earlier in the chapter, showing the ages of the Best Actress Award winners, is skewed right. You see on the right side that there are a few actresses whose ages are older than the rest.

  • Skewed left: If a histogram is skewed left, it looks like a lopsided mound with a tail going off to the left.

    Figure 7-2b shows a histogram of 17 exam scores. The shape is skewed left; you see a few students who scored lower than everyone else.

An illustration of a symmetric histogram and a skewed-left histogram.

FIGURE 7-2: Comparing the shape of a) a symmetric histogram and b) a skewed-left histogram.

Following are some particulars about classifying the shape of a data set:

  • Don’t expect symmetric data to have an exact and perfect shape. Data hardly ever fall into perfect patterns, so you have to decide whether the data shape is close enough to be called symmetric.

    If the shape is close enough to symmetric that another person would notice it, and the differences aren’t enough to write home about, I’d classify it as symmetric or roughly symmetric. Otherwise, you classify the data as nonsymmetric. (More sophisticated statistical procedures exist that actually test data for symmetry, but they’re beyond the scope of this book.)

  • Don’t assume that data are skewed if the shape is nonsymmetric. Data sets come in all shapes and sizes, and many of them don’t have a distinct shape at all. I include skewness on the list here because it’s one of the more common nonsymmetric shapes, and it’s one of the shapes included in a standard introductory statistics course.

    If a data set does turn out to be skewed (or close to it), make sure to denote the direction of the skewness (left or right).

As you can see in Figure 7-1, the actresses’ ages in the histogram are skewed right. Most of the actresses were between 20 and 50 years of age when they won, with about 27 percent of them between the ages of 25 and 30. A few actresses were older when they won their Oscars; about 8 percent were between 60 and 65 years of age, and not more than 2 percent (total) were over 70 (if you add the percentages from the last two bars in the histogram). The last few bars on the right side are what give the data a shape that is skewed right.

Measuring center: Mean versus median

A histogram gives you a rough idea of where the “center” of the data lies. The word center is in quotes because many different statistics are used to designate center. The two most common measures of center are the average (the mean) and the median. (For details on measures of center, see Chapter 5.)

Tip To visualize the average age (the mean), picture the data as people sitting on a teeter-totter. Your objective is to balance it. Because data don’t move around, assume the people stay where they are and that you can move the pivot point (which you can also think of as the hinge or fulcrum) anywhere you want. The mean is the place the pivot point has to be in order to balance the weight on each side of the teeter-totter.

The balancing point of the teeter-totter is affected by the weights of the people on each side, not by the number of people on each side. So the mean is affected by the actual values of the data, rather than the amount of data.

The median is the place where you put the pivot point so you have an equal number of people on each side of the teeter-totter, regardless of their weights. With the same number of people on each side, the teeter-totter wouldn’t balance in terms of weight unless the teeter-totter had people with the same total weight on each side. So the median isn’t affected by the values of the data, just their location within the data set.

Remember The mean is affected by outliers, values in the data set that are away from the rest of the data, on the high end and/or the low end. The median, being the middle number, is not affected by outliers.

Viewing variability: Amount of spread around the mean

You also get a sense of variability in the data by looking at a histogram. For example, if the data are all the same, they are all placed into a single bar, and there is no variability. If an equal amount of data is in each group, the histogram looks flat, with the bars close to the same height; this shows a fair amount of variability.

Warning The idea of a flat histogram indicating some variability may go against your intuition, and if it does, you’re not alone. If you’re thinking a flat histogram means no variability, you’re probably thinking about a time chart, where single numbers are plotted over time (see the section, “Tackling Time Charts,” later in this chapter). Remember, though, that a histogram doesn’t show data over time — it shows all the data at one point in time.

Equally confusing is the idea that a histogram with a big lump in the middle and tails sloping sharply down on each side actually has less variability than a histogram that’s straight across. The curves looking like hills in a histogram represent clumps of data that are close together; a flat histogram shows data equally dispersed, with more variability.

Remember Variability in a histogram is higher when the taller bars are more spread out around the mean and lower when the taller bars are close to the mean.

For the Best Actress Award winners’ ages shown earlier in Figure 7-1, you see many actresses are in the age range from 25 to 30, and most of the ages are between 20 and 50 years, which is quite diverse. Then you have some outliers, those few older actresses (two of them include Jessica Tandy for Driving Miss Daisy and Katharine Hepburn for On Golden Pond); their ages spread the data out farther, increasing its overall variability.

The most common statistic used to measure variability in a data set is the standard deviation, which, in a rough sense, measures the average distance that the data lie from the mean. The standard deviation for the Best Actress Award age data is 12.23 years. (See Chapter 5 for all the details on standard deviation.) A standard deviation of 12.23 years is fairly large in the context of this problem, but the standard deviation is based on average distance from the mean, and the mean is influenced by outliers, so the standard deviation will be as well (see Chapter 5 for more information).

In the later section, “Interpreting a boxplot,” I discuss another measure of variability, called the interquartile range (IQR), which is a more appropriate measure of variability when you have skewed data.

ExampleQ. The police checked the speeds of cars after the city painted lines on a certain section of a street where the road narrows. The speeds are organized in the histogram shown in the following figure. Describe what this histogram tells you about the speeds of the cars.

A histogram depicts the speed.

A. The speeds of the cars in this data set range from 28 miles per hour (lowest) to 41 mph (highest). Most of the cars traveled from 30 to 35 mph (you can tell by noting that those bars have the highest frequencies). A few cars drove faster than the rest (noted by the few short bars at the upper end), which indicates a skewed-right shape. The average speed seems to center around 32 mph (but this is hard to tell without doing more analysis).

5Yourturn An ATM machine asks customers who use the “fast cash” option to choose an amount in $50 increments from $100 to $500. Results from a recent sample of customer withdrawals are shown in the following figure. Discuss the shape, center, and spread of the data.

A histogram depicts the withdrawals.

6 A histogram of a sample of rods made by Rowdy Rod’s is shown in the following figure. The rods should be 100 inches in length. Discuss the company’s accuracy (in terms of meeting the length specification) by interpreting the shape, center, and spread of the data.

A histogram depicts the sample of rods.

7 A histogram of the amount of money 317 households spent on fruits and vegetables in a year is shown in the following figure (based on a random sample). Discuss the shape, center, and spread. What do these three characteristics say about how much families spend on fruits and vegetables? (Such analysis is called interpreting your results in the context of the problem.)

A histogram depicts the fruits and veggies frequency.

8 An investor monitors the percentage return for a particular group of stocks in Portfolio A over a one-year period. The one-year percentage returns for these stocks are shown in the following figure.

A histogram depicts the percent return on stocks.
  1. Some of the values on the x-axis are negative numbers. What does this mean?
  2. On any histogram in general, when (if ever) can the x- and/or y-axis contain negative values?

9 You make two histograms from two different data sets (see the following figures), with each one containing 200 observations. Which of the histograms has a smaller spread, the first or the second?

Two histograms depict the two different data sets.

Putting numbers with pictures

You can’t actually calculate measures of center and variability from the histogram itself because you don’t know the exact data values. To add detail to your findings, you should always calculate the basic statistics of center and variation along with your histogram. (All the descriptive statistics you need, and then some, appear in Chapter 5.)

Figure 7-1 is a histogram for the Best Actress ages; you can see it is skewed right. For Figure 7-3, I calculate some basic (that is, descriptive) statistics from the data set. Examining these numbers, you find the median age is 33.00 years and the mean age is 36.78 years.

An illustration of the descriptive statistics for best actress ages.

FIGURE 7-3: Descriptive statistics for Best Actress ages (1929–2021).

The mean age is higher than the median age because of a few actresses who were quite a bit older than the rest when they won their awards. For example, Jessica Tandy won for her role in Driving Miss Daisy when she was 80, and Katharine Hepburn won the Oscar for On Golden Pond when she was 74. The relationship between the median and mean confirms the skewness (to the right) found in Figure 7-1.

Here are some tips for connecting the shape of the histogram (discussed in the previous section) with the mean and median:

  • If the histogram is skewed right, the mean is greater than the median.

    This is the case because skewed-right data have a few large values that drive the mean upward but do not affect where the exact middle of the data is (that is, the median). Looking at the histogram of ages of the Best Actress Award winners in Figure 7-1, you see they’re skewed right.

  • If the histogram is close to symmetric, then the mean and median are close to each other.

    Close to symmetric means it’s almost the same on either side; it doesn’t need to be exact. Close is defined in the context of the data; for example, the numbers 50 and 55 are said to be close if all the values lie between 0 and 1,000, but they are considered to be farther apart if all the values lie between 49 and 56.

    The histogram shown in Figure 7-2a is close to symmetric. Its mean and median are both equal to 3.5.

  • If the histogram is skewed left, the mean is less than the median.

    This is the case because skewed-left data have a few small values that drive the mean downward but do not affect where the exact middle of the data is (that is, the median).

    Figure 7-2b represents the exam scores of 17 students, and the data are skewed left. I calculated the mean and median of the original data set to be 70.41 and 74.00, respectively. The mean is lower than the median due to a few students who scored quite a bit lower than the others. These findings match the general shape of the histogram shown in Figure 7-2b.

Remember The tips for interpreting histograms found in the previous section can also be used the other way around. If for some reason you don’t have a histogram of the data, and you only have the mean and median to go by, you can compare them to each other to get a rough idea as to the shape of the data set.

  • If the mean is much larger than the median, the data are generally skewed right; a few values are larger than the rest.
  • If the mean is much smaller than the median, the data are generally skewed left; a few smaller values bring the mean down.
  • If the mean and median are close, you know the data is fairly balanced, or symmetric, on each side.

Remember Under certain conditions, you can put together the mean and standard deviation to describe a data set in quite a bit of detail. If the data have a normal distribution (a bell-shaped hill in the middle, sloping down at the same rate on each side; see Chapter 5), the Empirical Rule can be applied.

The Empirical Rule (also in Chapter 5) says that if the data have a normal distribution, about 68 percent of the data lie within 1 standard deviation of the mean, about 95 percent of the data lie within 2 standard deviations of the mean, and about 99.7 percent of the data lie within 3 standard deviations of the mean. These percentages are custom-made for the normal distribution (bell-shaped data) only and can’t be used for data sets of other shapes.

ExampleQ. The police checked the speeds of cars after the city painted lines on a certain section of a street where the road narrows. The speeds are organized in the histogram shown in the following figure. Which is greater: the mean of this data or the median of this data?

A histogram depicts the speed frequency.

A. The mean is greater. The data is skewed to the right because there a few more outlying large values than there are smaller values. So the mean gets pulled out toward the large values to keep that seesaw-type balance.

10Yourturn The incomes of last year’s new graduates of a certain large and very successful program are shown in the following figure.

A histogram depicts the income frequency.
  1. Discuss the implications for graduates of this program.
  2. Estimate where the median salary is in this data set.
  3. Do you see any issues that anyone who tries to interpret this data should take into account?

Detecting misleading histograms

There are no hard and fast rules for how to create a histogram; the person making the graph gets to choose the groupings on the x-axis, as well as the scale and starting and ending points on the y-axis. Just because there is an element of choice, however, doesn’t mean every choice is appropriate; in fact, a histogram can be made to be misleading in many ways. In the following sections, you see examples of misleading histograms and how to spot them.

Missing the mark with too few groups

Although the number of groups used for a histogram is up to the discretion of the person making the graph, there is such a thing as going overboard, either by having way too few bars, with everything lumped together, or by having way too many bars, where every little difference is magnified.

Tip To decide how many bars a histogram should have, take a good look at the groupings used to form the bars on the x-axis and see whether they make sense. For example, it doesn’t make sense to talk about exam scores in groups of 2 points; that’s too much detail — too many bars. On the other hand, it doesn’t make sense to group actresses’ ages by intervals of 20 years; that’s not descriptive enough.

Figures 7-4 and 7-5 illustrate this point. Each histogram summarizes math observations of the amount of time between eruptions of the Old Faithful geyser in Yellowstone Park. Figure 7-4 uses six bars that group the data by ten-minute intervals. This histogram shows a general skewed-left pattern, but with 222 observations, you are cramming an awful lot of data into only six groups; for example, the bar for 75–85 minutes has more than 90 pieces of data in it. You can break it down further than that.

A histogram depicts the time
between eruptions
for Old Faithful geyser.

FIGURE 7-4: Histogram #1 showing time between eruptions for Old Faithful geyser math.

Figure 7-5 is a histogram of the same data set, where the time between eruptions is broken into groups of three minutes each, resulting in 19 bars. Notice the distinct pattern in the data that shows up with this histogram, which wasn’t uncovered in Figure 7-4. You see two distinct peaks in the data: one peak around the 50-minute mark, and one around the 75-minute mark. A data set with two peaks is called bimodal;Figure 7-5 shows a clear example.

Looking at Figure 7-5, you can conclude that the geyser has two categories of eruptions: one group that has a shorter waiting time, and another group that has a longer waiting time. Within each group, you see the data are fairly close to where the peak is located. Looking at Figure 7-4, you couldn’t say that.

A histogram depicts the time
between eruptions
for Old Faithful geyser.

FIGURE 7-5: Histogram #2 showing time between eruptions for Old Faithful geyser math.

Remember If the interval for the groupings of the numerical variable is really small, you see too many bars in the histogram; the data may be hard to interpret because the heights of the bars look more variable than they should. On the other hand, if the ranges are really large, you see too few bars, and you may miss something interesting in the data.

Watching the scale and start/finish lines

The y-axis of a histogram shows how many individuals are in each group, using counts or percents. A histogram can be misleading if it has a deceptive scale and/or inappropriate starting and ending points on the y-axis.

Remember Watch the scale on the y-axis of a histogram. If it goes by large increments and has an ending point that’s much higher than needed, you see a great deal of white space above the histogram. The heights of the bars are squeezed down, making their differences look more uniform than they should. If the scale goes by small increments and ends at the smallest value possible, the bars become stretched vertically, exaggerating the differences in their heights and suggesting a bigger difference than really exists.

An example comparing scales on the vertical (y) axes is shown in Figures 7-5 and 7-6. I took the Old Faithful data (time between eruptions) and made a histogram with vertical increments of 20 minutes, from 0 to 100; see Figure 7-6. Compare this to Figure 7-5, with vertical increments of five minutes, from 0 to 35. Figure 7-6 has a lot of white space and gives the appearance that the times are more evenly distributed among the groups than they really are. It also makes the data set look smaller, if you don’t pay attention to what’s on the y-axis. Of the two graphs, Figure 7-5 is more appropriate.

A histogram depicts the Old
Faithful geyser eruption times.

FIGURE 7-6: Histogram #3 of Old Faithful geyser eruption times.

ExampleQ. In an earlier practice question, you see data from a die that’s clearly loaded (not fair). However, someone (the gambler’s lawyer, perhaps?) could make that same die appear fair by setting up the histogram a certain way. Explain how the following histogram (made from that same data) makes the die appear to be fair.

A histogram depicts the loaded die data.

A. This histogram combines the results of rolling a 1 and 2, rolling a 3 and 4, and rolling a 5 and 6, so there are three bars on the histogram now, not six. This histogram is misleading because when you combine the results into three groups of two outcomes each, the differences that make the die loaded don’t show up. The lack of precision works to the advantage of the person who created the graph.

11Yourturn Suppose your friend believes their gambling partner plays with a loaded die (not fair). They show you a graph of the outcomes of the games played with this die (see the following figure). Based on this graph, do you agree with your friend? Why or why not?

A histogram depicts the single die outcomes.

12 The first month’s telephone bills for new customers of a certain phone company are shown in the following figure. The histogram showing the bills is misleading, however. Explain why, and suggest a solution.

A histogram depicts the telephone frequency.

Examining Boxplots

A boxplot is a one-dimensional graph of numerical data based on the five-number summary, which includes the minimum value, the 25th percentile (known as math), the median, the 75th percentile math, and the maximum value. In essence, these five descriptive statistics divide the data set into four parts; each part contains 25 percent of the data. (See Chapter 5 for a full discussion of the five-number summary.)

Making a boxplot

To make a boxplot, follow these steps:

  1. Find the five-number summary of your data set. (Use the steps outlined in Chapter5.)
  2. Create a vertical (or horizontal) number line whose scale includes the numbers in the five-number summary and uses appropriate units of equal distance from each other.
  3. Mark the location of each number in the five-number summary just above the number line (for a horizontal boxplot) or just to the right of the number line (for a vertical boxplot).
  4. Draw a box around the marks for the 25th percentile and the 75th percentile.
  5. Draw a line in the box where the median is located.
  6. Determine whether or not outliers are present.

    To make this determination, calculate the IQR (by subtracting math); then multiply by 1.5. Add this amount to the value of math and subtract this amount from math. This gives you a wider boundary around the median than the box does. Any data points that fall outside this boundary are determined to be outliers.

  7. If there are no outliers (according to your results from Step 6), draw lines from the upper and lower edges of the box out to the minimum and maximum values in the data set.
  8. If there are outliers (according to your results from Step 6), indicate their location on the boxplot with * signs. Instead of drawing a line from the edge of the box all the way to the most extreme outlier, stop the line at the last data value that isn’t an outlier.

Tip Many, if not most, software packages indicate outliers in a data set by using an asterisk (*) or star symbol and use the procedure outlined in Step 6 to identify outliers. However, not all packages use these symbols and procedures; check to see what your package does before analyzing your data with a boxplot.

A vertical boxplot for ages of the Best Actress Academy Award winners from 1929 to 2021 is shown in Figure 7-7. You can see that the numbers separating sections of the boxplot match the five-number summary statistics, as shown previously in Figure 7-3.

A box plot depicts the best actress ages.

FIGURE 7-7: Boxplot of Best Actress ages (1929–2021; math awards).

Remember Boxplots can be vertical (straight up and down) with the values on the axis going from bottom (lowest) to top (highest); or they can be horizontal, with the values on the axis going from left (lowest) to right (highest). The next section shows you how to interpret a boxplot.

ExampleQ. Make a vertical boxplot from this data set of exam scores: 43, 54, 56, 61, 62, 66, 68, 69, 69, 70, 71, 72, 77, 78, 79, 85, 87, 88, 89, 93, 95, 96, 98, 99, 99.

A. From an example question in Chapter 5, you found the five-number summary for this data set to be 43, 68, 77, 89, and 99, which completes Step 1. Steps 2 and 3 say to draw a vertical number line (the y-axis) that spans 43 to 99 and mark the five-number summary values. In Step 4 you draw a box with 68 and 89 marking the edges, and in Step 5 you draw a line for the median at 77.

   To start Step 6, you calculate math. Any large outliers would have to be greater than math, and any small outliers would have to be less than math. Because none of the exam scores fall outside of those values, there are no outliers. According to Step 7, you draw the remaining lines from the quartiles to the maximum (99) and minimum (43). See the completed boxplot in the following picture.

An illustration of the exam scores.

13Yourturn Suppose you have measured the height in inches of 11 randomly selected adults, and in order from lowest to highest they look like this: 59, 61, 65, 66, 67, 67, 69, 70, 72, 73, 75. Make a horizontal boxplot of these heights.

14 Suppose you have surveyed 14 randomly selected adults for their typical commute times to work, and the responses look like this: 16, 8, 35, 17, 13, 15, 15, 5, 16, 25, 20, 20, 12, 10. Make a vertical boxplot of these times.

Interpreting a boxplot

Similar to a histogram (see the section, “Interpreting a histogram”), a boxplot can give you information regarding the shape, center, and variability of a data set. Boxplots differ from histograms in terms of their strengths and weaknesses, as you see in the upcoming sections, but one of their biggest strengths is how they handle skewed data.

Checking the shape with caution!

A boxplot can show whether a data set is symmetric (roughly the same on each side when cut down the middle) or skewed (lopsided). A symmetric data set shows the median roughly in the middle of the box. Skewed data show a lopsided boxplot, where the median cuts the box into two unequal pieces. If the longer part of the box is to the right of (or above) the median, the data is said to be skewed right. If the longer part is to the left of (or below) the median, the data is skewed left.

As shown in the boxplot of the data in Figure 7-7, the ages are skewed right. The part of the box to the left of the median (representing the younger actresses) is shorter than the part of the box to the right of the median (representing the older actresses). That means the ages of the younger actresses are closer together than the ages of the older actresses. Figure 7-3 shows the descriptive statistics of the data and confirms the right skewness: the median age (33 years) is lower than the mean age (36.78 years).

Remember If one side of the box is longer than the other, it does not mean that side contains more data. In fact, you can’t tell the sample size by looking at a boxplot; it’s based on percentages, not counts. Each section of the boxplot (the minimum to math, math to the median, the median to math, and math to the maximum) contains 25 percent of the data no matter what. If one of the sections is longer than another, it indicates a wider range in the values of data in that section (meaning the data are more spread out). A smaller section of the boxplot indicates the data are more condensed (closer together).

Warning Although a boxplot can tell you whether a data set is symmetric (when the median is in the center of the box), it can’t tell you the shape of the symmetry the way a histogram can. For example, Figure 7-8 shows histograms from two different data sets, each one containing 18 values that vary from 1 to 6. The histogram on the left has an equal number of values in each group, and the one on the right has two peaks at 2 and 5. Both histograms show the data are symmetric, but their shapes are clearly different.

Figure 7-9 shows the corresponding boxplots for these same two data sets; notice they are exactly the same. This is because the data sets both have the same five-number summaries — they’re both symmetric with the same amount of distance between math, the median, and math. However, if you just saw the boxplots and not the histograms, you might think the shapes of the two data sets are the same, when indeed they are not.

Despite its weakness in detecting the type of symmetry (you can add a histogram to your analyses to help fill in that gap), a boxplot has a great upside in that you can identify actual measures of spread and center directly from the boxplot, where on a histogram you can’t. A boxplot is also good for comparing data sets by showing them on the same graph, side by side.

A histogram depicts the two symmetric data sets.

FIGURE 7-8: Histograms of two symmetric data sets.

Boxplots depict the two symmetric data sets.

FIGURE 7-9: Boxplots of the two symmetric data sets from Figure 7-8.

Tip All graphs have strengths and weaknesses; it’s always a good idea to show more than one graph of your data for that reason.

Measuring variability with IQR

Variability in a data set that is described by the five-number summary is measured by the interquartile range (IQR). The IQR is equal to math, the difference between the 75th percentile and the 25th percentile (the distance covering the middle 50 percent of the data). The larger the IQR, the more variable the data set is.

From Figure 7-3, the variability in age of the Best Actress winners as measured by the IQR is math years. Of the group of actresses whose ages were closest to the median, half of them were within 13 years of each other when they won their awards.

Tip Notice that the IQR ignores data below the 25th percentile or above the 75th, which may contain outliers that could inflate the measure of variability of the entire data set. So if data are skewed, the IQR is a more appropriate measure of variability than the standard deviation.

Picking out the center using the median

The median, part of the five-number summary, is shown by the line that cuts through the box in the boxplot. This makes it very easy to identify. The mean, however, is not part of the boxplot and can’t be determined accurately by just looking at the boxplot.

You don’t see the mean on a boxplot because boxplots are based completely on percentiles. If data are skewed, the median is the most appropriate measure of center. Of course, you can calculate the mean separately and add it to your results; it’s never a bad idea to show both.

Investigating Old Faithful’s boxplot

The relevant descriptive statistics for the Old Faithful geyser data are found in Figure 7-10.

An illustration of the descriptive statistics for Old Faithful data.

FIGURE 7-10: Descriptive statistics for Old Faithful data.

You can predict from the data set that the shape will be skewed left a bit because the mean is lower than the median by about four minutes. The IQR is math minutes, which shows the amount of overall variability in the time between eruptions; 50 percent of the eruptions are within 21 minutes of each other.

A vertical boxplot for length of time between eruptions of the Old Faithful geyser is shown in Figure 7-11. You confirm that the data are skewed left because the lower part of the box (where the small values are) is longer than the upper part of the box.

You see the values of the boxplot in Figure 7-11 that mark the five-number summary and the information shown in Figure 7-10, including the IQR of 21 minutes to measure variability. The center as marked by the median is 75 minutes; this is a better measure of center than the mean (71 minutes), which is driven down a bit by the left-skewed values (the few that are shorter times than the rest of the data).

Looking at the boxplot (Figure 7-11), you see there are no outliers denoted by asterisks. However, note that the boxplot doesn’t pick up on the bimodal shape of the data that you see previously in Figure 7-5. You need a good histogram for that.

Denoting outliers

Looking at the boxplot in Figure 7-7 for the Best Actress ages data, you see a set of what the computer software defines at outliers (ten in all) on the right side of the data set, marked by a group of asterisks (as described in Step 8 in the earlier section, “Making a boxplot”). Four of the asterisks lie on top of one another because four actresses were the same age, 61, when they won their Oscars.

A boxplot of eruption times for Old Faithful geyser.

FIGURE 7-11: Boxplot of eruption times for Old Faithful geyser math.

The computer uses certain criteria to determine what it decides is an outlier. You can verify these outliers by applying the rule described in Step 6 of the section, “Making a Boxplot.” The IQR is 13 (from Figure 7-3), so you take math years. Add this amount to math and you get math years; subtracting this amount from math, you get math years. So an actress whose age was below 8.5 years (that is, 8 years old and under) or above 58.5 years (that is, 59 years old or over) is considered to be an outlier.

Of course, the lower end of this boundary (8 years) isn’t relevant because the youngest actress was 21 (Figure 7-3 shows the minimum is 21). So you know there aren’t any outliers on the low end of this data set.

However, ten outliers, as defined by Minitab, are on the high end of the data set, where the 59-and-over actresses’ ages are. Table 7-2 shows the information on all ten outliers in the Best Actress ages data set.

TABLE 7-2 Best Actress Winners with Ages Designated as Outliers

Year

Name

Age

Movie

1968

Katharine Hepburn

60

Guess Who’s Coming to Dinner

1969

Katharine Hepburn

61

The Lion in Winter

1986

Geraldine Page

61

Trip to Bountiful

2007

Helen Mirren

61

The Queen

2018

Frances McDormand

61

Three Billboards Outside Ebbing, Missouri

2012

Meryl Streep

62

The Iron Lady

1932

Marie Dressler

63

Min and Bill

2021

Frances McDormand

64

Nomadland

1982

Katharine Hepburn

74

On Golden Pond

1990

Jessica Tandy

80

Driving Miss Daisy

Making mistakes when interpreting a boxplot

It’s a common mistake to associate the size of the box in a boxplot with the amount of data in the data set. Remember that each of the four sections shown in the boxplot contains an equal percentage (25 percent) of the data; the boxplot just marks off the places in the data set that separate those sections.

Remember In particular, if the median splits the box into two unequal parts, the larger part contains data that’s more variable than the smaller part, in terms of its range of values. However, there is still the same amount of data (25 percent) in the larger part of the box as there is in the smaller part.

Another common error involves sample size. A boxplot is a one-dimensional graph with only one axis representing the variable being measured. There is no second axis that tells you how many data points are in each group. So if you see two boxplots side by side and one of them has a very long box and the other has a very short one, don’t conclude that the longer one has more data in it. The length of the box represents the variability in the data, not the number of data values.

Remember When viewing or making a boxplot, always make sure the sample size (n) is included as part of the title. You can’t figure out the sample size otherwise.

ExampleQ. Here is a boxplot of the most recent test scores from a class of 36 students.

An illustration of the test scores.
  1. Describe the shape of the distribution. Are there any outliers?
  2. How many students are represented by the box in the middle of the boxplot?

A. Keep in mind that a boxplot is constructed from just the positions of the five values from the five-number summary and not much else. Be careful when making interpretations.

  1. The fact that both the line and the part of the box in the lower half of the boxplot stretch out longer than the same parts in the upper half leads you to believe that the data is also stretched out to smaller values. This means the data is likely skewed left. Also, it seems there is one outlier around the value of 40. (Yikes! Maybe that student wasn’t studying as hard as you are.)
  2. The middle box of any boxplot will contain 50 percent of the data. That’s everybody from math (the 25th percentile) to math (the 75th percentile). But that’s not enough to know how many students are in there. You also need the sample size. Fortunately, it’s given that there are math students in the class. That means there are mathstudents represented by the middle box in the boxplot.

15Yourturn Which of the following two boxplots contains more data: the one containing the heights of men or the one containing the heights of women?

Boxplots depict the heights of men and women.

Tackling Time Charts

A time chart (also called a line graph) is a data display used to examine trends in data over time (also known as time series data). Time charts show time on the x-axis (for example, by month, year, or day) and the values of the variable being measured on the y-axis (like birth rates, total sales, or population size). Each point on the time chart summarizes all the data collected at that particular time, for example, the average of all pepper prices for January or the total revenue for 2010.

Interpreting time charts

To interpret a time chart, look for patterns and trends as you move across the chart from left to right.

The time chart in Figure 7-12 shows the ages of the Best Actress winners, in order of year won, from 1929 to 2021. Each dot indicates the age of a single actress, the one that won the Oscar that year.

A scatterplot of time chart of age versus year Oscar won.

FIGURE 7-12: Time Chart #1 for ages of Best Actress Academy Award winners, 1929–2021.

Figure 7-12 shows a faint trend in age that is tending uphill, indicating that the Best Actress Award winners may be winning their awards increasingly later in life. Again, I wouldn’t make too many assumptions from this result because the data have a great deal of variability. It’s hard to say what may be going on here; many variables go into determining an Oscar winner, including the type of movie, type of female role, mood of the voters, and so forth, and some of these variables may have a cyclical pattern to them. As far as variability goes, you see that the ages represented by the dots do fluctuate quite a bit on the y-axis (representing age); all the dots basically fall between 20 and 80 years, with most of them between 25 and 45 years, I’d say. This goes along with the descriptive statistics found in Figure 7-3.

Understanding variability: Time charts versus histograms

Remember Variability in a histogram should not be confused with variability in a time chart. If values change over time, they’re shown on a time chart as highs and lows, and many changes from high to low (over time) indicate lots of variability. So a flat line on a time chart indicates no change and no variability in the values across time. For example, if the price of a product stays the same for 12 months in a row, the time chart for price is flat.

But when the heights of a histogram’s bars appear flat, the data are spread out uniformly across all the groups, indicating a great deal of variability in the data. (For an example, refer to Figure 7-2a.)

Spotting misleading time charts

As with any graph, you have to evaluate the units of the numbers being plotted. For example, it’s misleading to chart the number of crimes over time, rather than the crime rate (crimes per capita) — because the population size of a city changes over time, crime rate is the appropriate measure. Make sure you understand what numbers are being graphed, and examine them for fairness and appropriateness.

Watching the scale and start/end points

The scale on the vertical axis can make a big difference in the way the time chart looks. Refer to Figure 7-12 to see the original time chart of the ages for the Best Actress Academy Award winners from 1929 to 2021 in increments of ten years. You see a fair amount of variability, as discussed previously.

In Figure 7-12, the starting and ending points on the vertical axis are 0 and 100, which creates some extra white space on the top and bottom of the picture. I could have used 10 and 90 as my start/end points, but this graph looks reasonable.

Now, what happens if I change the vertical axis? Figure 7-13 shows the same data, with start/end points of 20 and 80. The increments of 10 years appear longer than the increments of 10 years shown in Figure 7-12. Both of these changes in the graph exaggerate the differences in ages even more.

Remember How do you decide which graph is the best one for your data? There is no perfect graph, no right or wrong answer; but there are limits. You can quickly spot problems just by zooming in on the scale and the start/end points.

Simplifying excess data

A time chart of the intervals between eruptions for the Old Faithful data is shown in Figure 7-14. You see 222 dots on this graph; each one represents the time between one eruption and the next, for every eruption during a 16-day period.

A scatterplot of time chart of age versus year Oscar won.

FIGURE 7-13: Time Chart #2 for ages of Best Actress Oscar Award winners, 1929–2021.

A scatterplot depicts the intervals between eruptions for Old Faithful geyser.

FIGURE 7-14: Time chart showing intervals between eruptions for Old Faithful geyser (math consecutive observations).

This figure looks very complex; data are everywhere, there are too many points to really see anything, and you can’t find the forest for the trees. There is such a thing as having too much data, especially nowadays when you can measure data continuously and meticulously using all kinds of advanced technology. I’m betting they didn’t have a student standing by the geyser recording eruption times on a clipboard, for example!

To get a clearer picture of the Old Faithful data, I combined all the observations from a single day and found its mean; I did this for all 16 days, and then I plotted all the means on a time chart in order. This reduced the data from 222 points to 16 points. The time chart is shown in Figure 7-15.

A scatterplot depicts the daily average intervals between eruptions for Old Faithful geyser.

FIGURE 7-15: Time chart showing daily average intervals between eruptions for Old Faithful geyser (math consecutive days).

From this time chart, I see a little bit of a cyclical pattern to the data; every day or two, it appears to shift from short times between eruptions to longer times between eruptions. While these changes are not definitive, it does provide important information for scientists to follow up on when studying the behavior of geysers like Old Faithful.

Remember A time chart condenses all the data for one unit of time into a single point. By contrast, a histogram displays the entire sample of data that was collected at that one unit of time. For example, Figure 7-15 shows the daily average time between eruptions for 16 days. For any given day, you can make a histogram of all the eruptions observed on that particular day. Displaying a time chart of average times over 16 days accompanied by a histogram summarizing all the eruptions for a particular day would be a great one-two punch.

ExampleQ. The following figure shows the revenues of a company taken over time. Each dot represents the revenue for that year, in millions of dollars.

A line graph depicts the revenues curve.
  1. What’s the time period over which this data set was collected?
  2. Describe what the line graph tells you about the revenues of this company over the time period.
  3. What do you need to take into account in order to properly interpret revenues (or any variable reported in dollars) over time?

A. Be aware of the impact of inflation over time on data reported in dollars. Some line graphs adjust for inflation and some don’t. Look at the fine print.

  1. The time period is approximately 1970 to 2000.
  2. Revenues increased a little in the 1970s and then began a more steady increase in the ’80s until around 1989, when the company broke the $50 million barrier. The company experienced a big jump around 1990–1991, with a very strong and steady increase each year since. In 2000, the company’s revenue was up to $225 million and rising.
  3. You should take inflation of the dollar over time into account. The revenues may look larger later on, but the value of the dollar has decreased over time as well. Some line graphs adjust for inflation.

16Yourturn Check out the sales of a particular car across the United States over a 60-day period in the following figure.

A line graph of sales.
  1. Can you see a pattern to the sales of this car across this time period?
  2. What are the highest and lowest numbers of sales and when did they occur?
  3. Can you estimate the average of all sales over this time period?

17 After a heart attack, Bob decides that it is a good time to get in shape, so he starts exercising each day and plans to increase his exercise time as he goes along. Look at the two line graphs shown in the following figures. One is a good representation of his data, and the other should get as much use as Bob’s treadmill before his heart attack.

Line graphs of exercise logs.
  1. Compare the two graphs. Do they represent the same data set, or do they show totally different data sets?
  2. Assume both graphs are made from the same data. Which graph is more appropriate and why?

18 The line graph in the following figure shows one company’s revenues over time. Explain why this graph is misleading and what you can do to fix the problem.

A line graph of revenue.

19 Line graphs typically connect the dots that represent the data values over time. If the time increments between the dots are large, explain why the line graph can be somewhat misleading.

Practice Questions Answers and Explanations

1 You don’t know the actual values of the data anymore; you know only what group they fall into. For example, if eight test scores fall in the group 70–79, they could all be 70, they could all be 79, or they could be some mixture in between.

2 One possible histogram of this data is shown in the following figure. Yours may have different groupings and look slightly different. The data is quantitative, so a pie chart isn’t appropriate. The groupings I chose for this histogram are as follows: 48–52; 53–57; 58–62; 63–67; 68–72; 73–77; 78–82; 83–87; 88–92; 93–97; and 98 and up.

A histogram depicts the frequency.

Remember The scale of the x-axis is continuous on a histogram, so if you have a place for a bar but you have no data, you should leave a space in that spot. If you don’t leave a space, you’ll have an incorrect histogram, because the gaps help you to distinguish how spread out the data is throughout the bars.

3 The following figure shows the relative frequency histogram for this data. Yours should look similar. The relative frequencies are mathor 4.4 percent; mathor 38 percent; mathor 49 percent; mathor 7 percent; and mathor 2 percent.

A histogram depicts the number of TV's owned.

4 You can find the total sample size from a frequency histogram.

  1. The relative frequency histogram is shown in the following figure. Note, the total sample size is missing here, but you can still find it by summing the heights of all the bars. Here, the total frequency is: math. To get the relative frequencies, divide the height of each bar by 70 (the total sample size) to obtain a percentage. For example, the percentage of ones is math percent, the height of the first bar.
    A histogram depicts the loaded die data.

    Remember When making your own bar charts and histograms, include the total sample size if you want to score points with your professor.

  2. No. If you receive only the percent in each group, without the total sample size, you can’t determine the original number in each group.

5 The shape of this data set is skewed right. The median is somewhere between $150 and $200, and the mean is slightly larger because of the single $300, $400, and $500 withdrawals. A crude measure of spread without these three single values is less than $50 (taking the range without those values divided by six); with these three values, a crude measure of spread is around $80 (the range of all the values divided by six). You can see a more accurate measure of spread in Chapter 5.

6 The histogram of rod lengths is skewed left. It appears to be centered very near 100, but the mean drops down a bit because of the one value lower than the rest (99.2). The range of the lengths is very tight, between 100.6 and 99.2 inches (each rod is within 1.4 inches). Most rod lengths are between 99.8 and 100.3 inches, indicating that the company seems to be doing very well in terms of accuracy. They may want to check into the four values of 100.6 and the one value of 99.2 to see whether they can improve the process somewhere.

7 The histogram is symmetric and mound shaped. You can see one peak around $500. The range is quite large, all the way from $100 to $800, due in part to the size of the household and also to dietary habits and the availability and price of fruits and vegetables. Most households spent between $200 and $700 on these items per year.

Remember “Interpret the results” means you need to do two things. First, identify the statistics and what they mean numerically, and second (and most important), apply those statistics to the scenario at hand. Discuss what the results mean in the context of the problem.

8 Percentage return means the difference between the ending value minus the beginning value, divided by the beginning value; it can be a positive or negative number or zero for no change.

  1. The x-axis represents the recorded variable, which is the one-year return for each stock. A negative number means that the stock lost money during that year (the price at the beginning was more than its value at the end).
  2. The x-axis will contain negative values if some of the data are negative. However, the y-axis of a histogram reports the counts or percentage of data in each grouping, and those values are always greater than or equal to zero.

9 The first figure has a smaller spread because the bars are higher in the middle, near the mean. Therefore, on average, many of the values in this data set are close to the mean. The second figure shows all the data spread out about equally, which indicates that some are close to the mean, but an equal percentage are a medium distance away, and another equal percentage are a long distance away. This results in a larger spread.

Warning Don’t interpret a flat histogram as meaning the data shows “no change” or no spread. A flat histogram means the data does have quite a bit of spread. Data with a bell shape that’s tight around the middle has a much smaller spread, because you generally measure spread as average distance from the middle.

Remember Note that the values on the x-axis are the same for both graphs. If the values aren’t the same, you’ll have difficulty comparing the spreads alone; you should also take the scale into account. This may be above and beyond your particular course, but if you want to compare spreads with data sets that have different scales and different means, you can use the coefficient of variation, which is the standard deviation divided by the mean. A large coefficient of variation means the spread is large, relative to the mean. A small coefficient of variation means the spread is small, relative to the mean.

10 Be aware that people may be tempted to give an incorrect response to a salary question (in other words, lie). Such a fib is called response bias, and it results in data that’s systematically over or under the truth (in this case, probably over).

  1. The graph shows that this group is making plenty of money, although the graph is skewed to the right, with fewer graduates making the large amounts. The range is quite large, going from $50,000 to $100,000, and the center is probably in the high-$60,000 range.
  2. Because the heights of the bars sum to 13 math, you know 13 salaries make up the data set. The median is the one in the middle, which is the seventh salary. Because the first bar contains five salaries and the second bar contains three, the seventh number is in the second bar, which is in the $60,000 range.
  3. You have only 13 salaries — from a “large” program — so the data probably isn’t precise, because it represents such a small part of the overall group. Also, you need to keep in mind that these values will change over time.

11 No. The die isn’t loaded; the graph is loaded. The histogram is misleading because it starts at 40 on the y-axis and goes to only 65. The differences in the heights of the bars are exaggerated because the graph doesn’t start at zero. A more fair and balanced-looking graph of the same data is shown here. Notice the frequencies for the outcomes are fairly close, indicating no evidence of the die being loaded.

A histogram depicts the single die outcomes.

Warning Beware of graphs that don’t start at zero on the y-axis. They may make the results look more dramatic than necessary, which is misleading to the reader.

12 This histogram has a great deal of open space at the top, and the bars appear to be very short and close together in height. The scale for the y-axis of this graph uses oversized increments. A better histogram would have smaller increments on the y-axis, and it wouldn’t include numbers that go beyond what you need to show the data. Such a histogram follows.

A histogram depicts the telephone bill.

13 First, you use the tricks and tips in Chapter 5 to find the five-number summary, which is 59, 65, 67, 72, and 75. Steps 2 and 3 say to draw a horizontal number line (the x-axis) that spans 59 to 75, and mark the five-number summary values. In Step 4 you draw a box with 65 and 72 marking the edges, and in Step 5 you draw a line for the median at 67.

To start Step 6, you calculate math. Any large outliers would have to be greater than math, and any small outliers would have to be less than math. Because none of the heights fall outside of those values, there are no outliers. According to Step 7, you draw the remaining lines from the quartiles to the maximum (75) and minimum (59). See the completed boxplot in the following figure.

A boxplot depicts heights of randomly selected adults.

14 Remember to first order the commute times and then find the five-number summary, which is 5, 12, 15.5, 20, and 35. Next, draw a vertical number line (the y-axis) that spans 5 to 35, and mark the five-number summary values. Then draw a box with 12 and 20 marking the boundaries of the box, and draw a line for the median at 15.5.

To check for outliers, you start by calculating math. Any large outliers would have to be greater than math, and any small outliers would have to be less than math. There are no commute times less than 0, but there is one greater than 32: the maximum value, 35. You label the value of 35 with an asterisk in the boxplot and draw the line from the box to the largest value less than the boundary value of 32, namely 25. On the low end, you draw the remaining line from the edge of the box down to the minimum of 5. See the completed boxplot in the following figure.

A boxplot depicts commute time to work.

15 Without the sample size, there’s no way to know for sure. The heights of the men are more spread out than the heights of the women, but that only tells you about the variation. You can see that the range and IQR of the men’s heights are both clearly greater. But for all you know, the boxplot for the men’s data is based on 20 males while the boxplot for the women’s data is based on 2,000 females. That’s why it’s so important that plots also include the sample size — so you know how much information your conclusions are based on.

16 Interpreting a line graph is different than interpreting a histogram; a line graph represents many snapshots of the situation over time, each one summarized by one point, and a histogram shows one single snapshot in detail of all the data at once.

  1. No. The data seem to fluctuate back and forth from around 350 to 800 cars sold per day.
  2. You can’t tell exactly, but the highest sales figure is around 800, occurring on days 1 and 21, and the lowest, at 350, occurred only a few days before, around day 17. Maybe the customers knew a sale was coming and waited to buy.
  3. The average appears to be around 600 cars sold per day, looking at what the values on the y-axis seem to center around. (The actual average is 613.)

17 The way data are organized can greatly affect how readers interpret graphs. Look at the way the data are organized before you think about what the graph means.

  1. They do represent the same data set. The difference is in the scale used on the y-axis. The second graph has larger increments on the y-axis, so the differences in the data over time are played down.
  2. The first graph is more appropriate. It has a scale that uses most of the space on the graph, and it doesn’t understate or overstate Bob’s progress over time.

18 This line graph is misleading because the time increments on the x-axis (time) aren’t equally spaced, but the graph presents them as if they are, which incorrectly makes it look like the revenue is increasing at the same rate over time. To fix the problem, you need to space the time periods shown on the x-axis properly. The correct line graph is shown in the following figure.

A line graph of revenue.

Remember When you see a line graph, look at the time increments on the x-axis and make sure the times are equally spaced in terms of the number of years between them.

19 If the time increments between the dots are large, and you connect the dots, you assume that the change that took place during the interim period (when data wasn’t collected) occurred at a steady rate, represented by the line that connects the dots. That may not always be the case.

If you’re ready to test your skills a bit more, take the following chapter quiz that incorporates all the chapter topics.

Whaddya Know? Chapter 7 Quiz

Quiz time! Complete each problem to test your knowledge on the various topics covered in this chapter. You can then find the solutions and explanations in the next section.

1 You have a data set of exam-taking times, where 60 minutes is the maximum time. Most students finished at between 30 and 40 minutes, but a few of them took longer, and a couple of them turned in their exams at the very end. Is the histogram of exam times symmetric, skewed right, or skewed left?

2 If the mean is much larger than the median, are the data symmetric?

3 Suppose a data set contains the values 10, 11, 12, 13, 12, 10, 11, 12, 10, 16. Is 16 considered an outlier according to the criterion described in this chapter?

4 A larger boxplot means a larger data set. True or false?

5 Twenty-five percent of the data lies between each section of a boxplot. True or false?

6 Changing the scale on a graph can change the way it looks. True or false?

7 You can get the size of the data set from its boxplot. True or false?

8 A flat histogram shows no variability. True or false?

9 Which data set has a higher coefficient of variation: data set 1 (Mean 10, StDev = 2) or data set 2 (Mean 2, StDev = 10)?

10 Which statistics make up the five-number summary?

Answers to Chapter 7 Quiz

1 The answer is skewed right because the times tend to be on the left side of the distribution and then trail off to the right, with fewer and fewer of them getting larger and larger.

2 No. The data are not symmetric if the mean is much larger than the median. The mean is influenced by skewness and outliers, and the median is not, so some outliers and/or skewness are pushing the mean past the median in this case.

3 Yes. The criterion is that if the data value lies above math, then it’s an outlier on the right side. In this case, math is 12 because it’s the median of the upper half of the ordered data. The value of math is 10 because it’s the median of the lower half of the ordered data. The IQR is math, and math. So values above math are considered outliers by this criterion. This means 16 is considered an outlier in this case.

4 False. A larger boxplot means you have more variability in the data between math and math. The data are more spread out. It does not indicate anything about the size of the data set.

5 True. Twenty-five percent of the data lies between the min and math, math and the median, the median and math, and math and the max.

6 True. If you make the scale include large increments, the graph appears to have smaller changes from group to group. If you make the scale include small increments, the graph appears to have larger changes from group to group.

7 False. You cannot get the size of the data set from its boxplot. All you know is that 25 percent of the data lies within each section; you don’t know any more than that.

8 False. A flat histogram indicates a lot of variability, as you measure variability by distance from the mean. A flat histogram has the same amount of data near the mean as is further from the mean. There is data further from the mean, so there is variability.

9 Data set 2 has a higher coefficient of variation. The coefficient of variation is the standard deviation divided by the mean. For data set 1, the coefficient of variation is math. For data set 2, the coefficient of variation is math.

10 The five-number summary consists of the minimum value, math, the median, math, and the maximum value. (Note that the mean is not part of the five-number summary.)

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.174.111