2

Chapter

Making sense of statistics: Achieving an overall view of your data

What you’ll learn

In this chapter, we give you the basic concepts and vocabulary that you need to understand and talk competently about data. You will learn how to gain an overview of your data by looking at so-called ‘frequency distributions’ – basically an answer to the question: ‘How much of what is there?’. This will help you get a quick but solid understanding of your data and the trends that they may reveal.

Data conversation

‘The results from a recent survey among our ten most important business customers are really worrying me’, said Jim, while nervously playing with his pencil. ‘We asked them to indicate how satisfied they are with our customer relationship management and, on average, they gave us a rating of just 6.8 out of 10 points. This is considerably worse than last year where we received a rating of 8.5’, he continued.

Jim joined the customer relationship team only six months ago, thus he felt uncomfortable delivering bad news – particularly because the head of customer relationship management attended the meeting.

His team members all looked a bit confused. They thought they had done a good job and they couldn’t understand why they didn’t get a better rating.

No one said anything, until one of the team members, Lisa, raised her hand and asked: ‘Are there any outliers?’ Jim was a bit perplexed and replied that he hadn’t checked for outliers and that he didn’t know why he should. ‘Well, it might help us to better understand this result and maybe it shows that things aren’t as bad as they look’, she explained. After a short discussion, the team agreed that Jim should take a look at the results again and look for outlier cases.

The follow-up meeting on the results of the B2B (business to business) customer survey was scheduled for the next day.

Jim carefully looked at the data again to get an idea about outliers and the measures of central tendency.

At the start of the meeting Jim said ‘I found an interesting pattern’, and passed a sheet with two charts to his colleagues (see below). ‘It seems that the majority of our B2B customers is in fact happy with our customer relationship management. Most B2B customers, except for two, gave us very good ratings’, Jim argued.

His colleagues and his boss were intrigued as Jim could tell by the way they examined the sheet.

‘Oh, I think I know what happened. Remember this very inconvenient misunderstanding we had with two large B2B customers? That was just a few days before we sent out the survey. That probably affected their response’, one of the team members mentioned. Everyone knew about this incident and they were happy that they had been able to straighten things out and clarify the misunderstanding in the meanwhile. ‘Well done, Jim. We should definitely include this graph in our report to the director’, the head of customer relationship management said. ‘We can write a comment on these two outliers and explain why they probably gave us such a poor rating. We can show that these outliers heavily impact the mean and that the median is more robust and more representative of our data’, she elaborated. After the meeting, Jim went back to his office. He was happy because not only had he learnt something new about statistics, he had also helped his team to better understand the results of the survey.

Upper graph: mean (dotted line) and median (the other line) when outliers are included. Lower graph: mean (dotted line) and median (the other line) when outliers are excluded.

Intuition and experience are important pillars in professional life. However, with the world becoming ever more complex, volatile and dynamic, relying on intuition and experience alone can be risky and result in bad decisions.

Sometimes our intuition tricks us and sometimes even the most experienced managers must come to realise that things work differently than they thought. And being wrong can come at high costs. Being wrong can mean that sales shrink, customers quit or, in the worst case, that people die. And that’s where data come into play.

Statistics is the key to make your data speak. Statistics allows you to decipher your data and discover the story they tell. But why should you invest your time in engaging with statistics when you have data analysts in your organisation whose job it is to squeeze the answers to your questions out of your data?

The answer is simple: having a basic knowledge of statistics helps you to fully exploit the insights from the data analyses and draw the right conclusions. Perhaps, even more importantly, it helps you understand whether the statistical models that were used really fit the purpose. Otherwise, you may get answers to questions that you never asked.

So let’s get started by understanding the types of data that we need to deal with in business and why Jim should carefully look at outliers before making generalisations.

Statistics is used to analyse all kinds of data that involve numbers. These data are called quantitative data. They are objective and measurable. Quantitative data help to quantify a phenomenon and to answer questions such as ‘what are our best-selling products?’, ‘how much money did people donate to our charity?’ or ‘how many employees have quit their job in our company in the last six months?’.

Quantitative data can be obtained, for instance, through surveys, experiments, or metrics (like website users’ online behaviour or your accounting system).

If your data do not involve numbers, then they are called qualitative data. They are subjective and filled with personal views. Qualitative data give us more detail and context and answer questions such as ‘why do people buy our products?’ or ‘how come that we have difficulties recruiting volunteers?’. Qualitative data can be gathered through focus groups, observations or text documents, just to name a few.

When we talk about data analysis in this book, we refer to the analysis of quantitative data.

In the modern working world, numbers in the form of quantitative data have moved to the forefront of decision making. Some might even say that business has developed a bit of an obsession with quantitative data. Everything seems to be counted, measured and put into numbers (whether it makes sense or not). However, before we can start with crunching numbers, we need to know how we can obtain quantitative data. How do we get all the numbers? And what are the different types of data we can collect and use for our decision making?

Quantitative data are measured by variables. You can imagine variables as kitchen vessels and data as the things that are inside. Each vessel contains things of the same kind. For instance, one bowl contains salad and the other apples. Within each bowl, there may be a certain variation. The salad leaves may differ in their size and the apples in their roundness. In other words: variables capture things of the same kind that can change and take different values. People, for example, can vary regarding their age, their consumption preferences, their propensity to volunteer, or their donation or purchasing behaviours. Similarly, organisations can vary by their size, their resources, or their social impact. Likewise, nations can vary in their political system or their official languages. Moreover, things can change over time (e.g., people’s mood, income, or health; organisational growth; a nation’s economic stability).

We can distinguish variables based on how they can be measured. Why is that important? Because it determines what you can do with these variables. Using inappropriate or even wrong statistical methods leads to bad decisions.

Figure 2.1 gives you an overview of the different types of variables and in Figure 2.2 you can find a practical example for each of the variable types.

Variables can be classified as either categorical or continuous (Cramer and Howitt, 2004). A categorical variable, as the name implies, is one that consists of categories. These categories are mutually exclusive so that an object only falls into one of these categories. Categorical variables you might be familiar with are organisation type (for-profit, non-governmental, public) or type of donation (money, services, goods, blood, organs etc.). In its simplest form, a categorical variable is made up of two categories. Such variables are called binary variables. Examples of binary variables are being a client or not; having the lead in a project or not; and being a minor or not.

If a variable contains more than two categories such as nationality (American, British, French, Spanish, German, Swiss etc.) or job location (London, New York, Zurich etc.), then we speak of a nominal variable. The categories of binary and nominal variables are unordered. That means, we consider them as equal. When the categories are ordered, we have an ordinal variable. Examples include educational level (elementary school, high school, higher education) or job title (employee, manager, director, chief operating officer etc.). Also, when a humanitarian organisation asks beneficiaries about their health and they can answer ‘very bad’, ‘somewhat bad’, ‘neutral’, ‘somewhat good’ or ‘very good’, you have an ordinal variable. However, while we can rank these answers, we cannot tell anything about the differences between the values. We cannot say, for instance, that ‘neutral’ is twice as good as ‘somewhat bad’.

If we want to say something about the differences between the values, we need continuous variables. Continuous variables can take the form of interval or ratio variables.

An interval variable is one where the difference between any two values is meaningful. Let’s say an organisation conducts a survey among its employees. The employees would be asked to report their job satisfaction with their work on a seven-point scale ranging from 1 (not satisfied at all) to 7 (very satisfied). The differences between any two values are meaningful because the intervals are equal (e.g., the differences are equal between 1 and 2 or between 5 and 6).

Ratio variables not only require equal distances between any two values but also a ‘natural’ zero point. For instance, the number of strategic projects completed is a ratio variable as it has a meaningful zero point: 0 means that someone has not yet completed a strategic project and 5 means that someone has already completed five strategic projects.

Interval and ratio variables can be discrete or ‘truly continuous’ (Field, 2018). Discrete means that a variable can only take specific values (usually whole numbers). The number of strategic projects completed is an example of a discrete variable – you may have completed one, two, three, six or eight, projects but not 3.4 or 5.2 projects. Continuous means that the variable can take an infinite number of values within a range of values. A typical example is the exact weight or height of a person (e.g., someone can be 1.7224353636 m tall).

You now understand that there are different types of variables and that these can be distinguished based on their levels of measurement. To further clarify the different types of variables, Figure 2.2 provides an example.

How to say it

Spell out variable names

Time is precious and to communicate as efficiently as possible we often use abbreviations and acronyms. When talking about variables, it has become common place to employ abbreviations (e.g., Att. = attitude; Beh. = behaviours) and acronyms (e.g., ASPM = average sales per month). But caution is warranted when doing so. Such labels may be clear and easy-to-understand for those who are familiar with a topic or a data set (experts), but not to the rest (novices). Using abbreviations and acronyms for variable names often causes confusion among novice audiences and makes it hard for them to follow the analysis or the conclusions drawn. To avoid imposing unnecessary stress on audiences and scaring them off, spell out variable names whenever possible, i.e., briefly mention what an acronym means and put the explanation on your slides or reports as well. Ask your data scientists to explain the acronyms that they are using, especially if you feel that other audience members are too shy to ask about this.

So now that you have seen the different kinds of data, what can we do with them?

Often, we are interested in the relationship between variables. We might want to know whether people’s income influences the amount of charitable donations that they make or whether monetary incentives increase your employees’ performance. Such relationships can be expressed in terms of two types of variables: predictors and outcomes (Field, 2018).

Taking the example from above, income would be the predictor and the amount of charitable donations would be the outcome. Likewise, incentives would be thought to predict job performance.

Sometimes predictors are also referred to as independent variables and outcomes as dependent variables. Be careful with these terms though. Strictly speaking, you should only speak of independent and dependent variables when the proposed cause was manipulated (that means directly influenced or modified). For instance, you could be interested in whether the type of Facebook posting predicts the number of likes. You may make a posting with an appealing image that shows the impact of your humanitarian work and compare it to a posting that only verbally describes the impact of your humanitarian work. Then, you could measure the number of likes for each posting. Because you actively manipulated – or changed – the type of Facebook posting you can refer to it as a predictor or independent variable and you can refer to the number of likes as an outcome or dependent variable. However, in case you are not sure whether the predictor was manipulated or not, just stay with the predictor/outcome terminology.

How to say it

Call the child by its name

Managers often tell us anecdotes about their colleagues or data analysts using overly complex and technical vocabulary. ‘It is as if they were talking in another language’, a manager reported jokingly in one of our workshops. To keep your audience interested and focused during your data presentations (and to avoid that their thoughts wander off to their next holiday trips, their most urgent to-dos, or their plans for dinner), it is important to use clear, concrete (example-driven) and concise language. It may be hard to follow your analysis when you say something like ‘a predictor had a significant impact on the outcome’. Instead, be specific about what you mean. Tell your audience that the number of food packages that were distributed (predictor) significantly increased the well-being of the beneficiaries (outcome) or that the amount of advertising spending (predictor) positively impacted product sales (outcome).

So far, we have been concerned with the types of variables (i.e., categorical vs continuous) and their function in relationships (predictor vs outcome). As you will see in the following pages, these distinctions are key to statistics literacy. They provide you with a robust foundation to analyse data and choose appropriate and informative measures. Keep in mind that you always gather and analyse data with a specific purpose in mind (except if you belong to the rare group of people who engage with data just for the fun of it). Thus, make sure that the measures you choose are suitable for your goals and help you make good decisions.

Figure 2.3 gives an overview of the basics that you need to know to get an overview and a feel for your data. This is crucial because in a world where everything needs to be done immediately and where we are running from meeting to meeting, time is money. And so is knowledge of how to get a quick but solid understanding of your data and the trends that they may reveal.

Imagine that you just received data on the success of an important strategic project. You open the file, see all the numbers, and feel your heart pound faster and your hands getting sweaty. You burn to learn what the data will tell you, but you don’t know where to start.

An easy and quick way to get an overview of your data and a feel for trends – provided there are any – is to construct frequency distributions (Figure 2.3). Frequency distributions show how often each value occurs in your data set (Field, 2018). Frequency distributions can be presented using different formats: tables and graphs. Let’s assume that your team has collected data about how long volunteers have been working for your organisation (ratio variable: length of volunteering in years) and you now want to see how the values are distributed. You can organise the data into a table. As you can see in the table in Figure 2.4, the data indicate that 29 volunteers have served your organisation for 5 years, whereas only one has been in your organisation for 9 years. Likewise, you can organise the data in a graphical representation (Figure 2.4). This can be easily done with data visualisation tools such as Tableau. When visualised as graphs, frequency distributions can take the form of histograms or bar charts. The y-axis represents the frequency count and the x-axis represents the variable of interest. The graph contains the same information as the table, but in visual form. Histograms are often preferred to tables as they allow for a more comprehensive and richer overview – particularly when you have large data sets and many different values per variable.

Frequency distributions are a bit like buildings: they take many different shapes and sizes. Therefore, having a vocabulary to describe these types of distributions is crucial for anyone involved with data. Distributions are most commonly described in terms of how much they deviate from a normal distribution (Field, 2018). You may not have heard about normal distributions (also known as Gaussian distributions), but you have surely seen them. A normal distribution forms a bell curve and is symmetrical: if we fold it in the middle, the two sides look completely the same. Normal distributions are too perfect to be true you may think. But surprise, surprise, they can be found anywhere: lots of natural and man-made phenomena are normally distributed. For instance, height, IQ, or marks on a test are commonly observed to follow a normal distribution. Look at the histogram in Figure 2.4 again: the black line represents the normal curve. As you can see, the length of volunteering in this organisation is nearly normally distributed. The nearer you come to the centre – or the middle – of the distribution, the more values you have. That means that there are many values in the centre and only a few values on the tails.

Distributions can deviate from normal in the following two ways: (1) lack of symmetry (skewness) and (2) pointiness (kurtosis). Skewed distributions are not symmetrical because the most frequently observed values cluster at one end of the distribution and not in the centre of the distribution (Field, 2018). A positively skewed distribution means that the most frequent values are clustered at the lower end of the distribution, whereas a negatively skewed distribution means that the most frequent values are clustered at the upper end of the distribution (Figure 2.5).

Kurtosis tells us something about how pointy a distribution is. It describes the degree to which values cluster at the ends – also referred to as the tails – of a distribution (Field, 2018). Distributions with positive kurtosis have many values in the tails and many values close to the centre. This is what makes them look pointy. The opposite applies to distributions with negative kurtosis. Here you have fewer values in the tails and fewer values close to the centre. The curve looks flat because it has more dispersed values with lighter tails (Figure 2.6). Skewness and kurtosis statistics can give you helpful insights about your data. Let’s assume you wanted to rate your attractiveness as an employer using different questions measured on a scale from 1 (to a very little extent) to 7 (to a very great extent) (Sivertzen et al., 2013). If the analysis shows that your employees’ responses about the social value of your organisation (i.e., whether your organisation offers a positive and pleasant social environment) have a heavy positive skew, you should probably be alerted. It means that your employees most frequently rated your organisation as having a low social value. Similarly, the data on the interest value of your organisation (i.e., whether your organisation offers interesting and stimulating jobs) might exhibit a positive kurtosis. This means that you have more extreme values than if the data were normally distributed. Specifically, there is a greater amount of people who either find their jobs very interesting or very uninteresting (i.e., many scores in the tails).

After having an overview of how your values are distributed, you may want to find out where the centre of the distribution lies. Measures of central tendency (also known as central location) are summary statistics that locate the middle point or typical value of a distribution. It is important to know about the measures of central tendency because they help you describe your data with a single value (Porkess and Goldie, 2012). However, failure to choose the right – or most appropriate – measure of central tendency can give you a distorted or misleading impression of your data. The three most commonly used measures are: the mode, the median and the mean (Cramer and Howitt, 2004).

The mode is the value that occurs most often in your data set (Field, 2018). The mode is easy to find in a histogram because it is the tallest bar. The two bar charts in Figure 2.7 show how often the members of a sports association have made a donation in the last ten years. In the chart on the left, the mode is 5 because it is the value that occurs most often. It means that the most frequent number of donations is five. But as illustrated in Figure 2.7, some distributions have more than one mode. A distribution with two modes is called a bimodal distribution and a distribution with more than two modes is known as multimodal distribution. The mode is the measure of central tendency that can be used for categorical and continuous variables (provided that they have discrete values).

The mode, although easy to calculate and comprehend, has several downsides. First, when each value only occurs once, there is no mode. Second, the mode may provide an inaccurate description of your data because it does not consider all values (it only considers the most frequently observed value and ignores all other values). The following two charts (Figure 2.8) illustrate this point. As you can see in the graph on the right, the mode does not accurately locate the central tendency when the most frequently observed value is far away from the rest of the values.

The median is the value that splits the data into two halves. You can find the median by ordering your data from smallest to largest. The median is the value in the middle that has an equal number of values below and above it (Field, 2018). Imagine going around in your organisation and asking 11 team leaders from different divisions about the size of their team. You note the following number of team members for these 11 managers: 3, 15, 7, 2, 10, 5, 6, 8, 11, 13, 4. To calculate the median, you first need to sort these values into ascending order: 2, 3, 4, 5, 6, 7, 8, 10, 11, 13, 15.

In a next step, you count the number of scores (n), add 1 to this value, and then divide it by 2.

Equation: Median odd

(n+1)2=(11+1)2=122=6

where n is the number of values.

This indicates to us that the sixth value is the middle value of this distribution. We thus know that the median is 7 team members. Calculating the median is straightforward when you have an odd number of values. But what happens when you have an even number of values (Figure 2.9)? What if you asked 12 managers about the size of their team? Let’s say this twelfth team leader has 9 people in her team. This would mean that the median is halfway between the sixth and the seventh value (Equation: Median even). To get the median, we just add the sixth and the seventh value and divide this value by 2. Hence, the median number of team members would be 7.5. Figure 2.9 visually summarises the calculations for the median.

Equation: Median even

(n+1)2=(12+1)2=132=6.5

where n is the number of values.

The median can be used with ordinal, interval and ratio variables. We cannot use it with binary or categorical variables because the categories are considered equal and therefore unordered. The median is similar to the mode in that both ignore most values in the data set. The median only considers the value(s) in the middle of the distribution and does not consider all information available in the data. However, the median is relatively robust and less distorted by skewed data and outliers than the mean (Field, 2018).

Probably the most popular and known measure of central tendency is the mean. The concept behind the mean is quite simple, although the equation may look cumbersome. You basically just add up all the values and then divide the sum by the number of values that you have. The symbol x¯ stands for the mean, ∑ is the Greek letter Sigma and tells you to sum up all values (x) and n represents the number of values you have. In our example with the 12 managers the mean is 7.75. That signifies that, on average, the managers have 7.75 team members.

Equation: Mean

x¯=i=1xinn

where x¯ represents the mean, ∑ means sum up, n denotes the number of values, xi means the ith value of x.

Equation: Mean (example)

2 + 3 + 4 + 5 + 6 + 7 + 8 + 9 + 10 + 11 + 13 + 1512=7.75

The mean can only be calculated with interval or ratio variables. Moreover, the mean has two major disadvantages: it can be influenced by skewed data and outliers. As the data becomes skewed, the mean loses its ability to represent the most typical value in the data set because the skewed data is dragging it away from the centre (Field, 2018). Look again at Figure 2.5 and try to imagine how skewness affects the mean. Outliers are extreme values in that they are very different from the other values in the data set (Field, 2018). Because the mean considers all values, one or a few extreme values can heavily distort it (see dialogue box). When you have outliers and/or skewed data, it is recommended to use the median instead of the mean (remember: the median is more robust).

We have learnt that the mode, the median and the mean are the most commonly used measures to determine the middle point of a distribution. However, each measure comes with advantages and disadvantages. Moreover, some measures only work for specific variable types. So, the question is what measure of central tendency to use? The decision tree above helps you choose an appropriate measure (Figure 2.10).

How to say it

The thing with the average

When people speak of the ‘average’, they are often referring to the ‘mean’. But as you can see, there are several types of averages, with the most commonly used being the mode, median, and mean. Thus, to avoid misunderstandings, always make explicit what measure(s) of central tendency you are using and why. Also, you may want to make people aware of the limitations of the mean and point them to other factors in a data set, such as the variance. This is further discussed in the next section.

How different is my data? A deviant tale about range

Locating the middle of the distribution is one thing, but the other important point is to identify the spread, or dispersion, of the values (Porkess and Goldie, 2012). The spread tells you how scattered the data is. This is important to know because it gives us an idea of how informative or reliable the measures of central tendency are. Put differently: the spread gives us an idea of how well the measures of central tendency represent the data. If the spread of values is large, the measures of central tendency are less representative of the data than if the spread of values is small. Therefore, we want the spread to be small. Let’s illustrate this with an example. Imagine you were the leader of a sales team and you saw that, on average, your team sold 1,000 cars per year over the last 15 years. The mean value of 1,000 cars, however, is not really informative when the number of cars sold varies largely between the years – for instance, when your team sold 50 or 60 cars in some years and 5,000 cars in other years.

The range is the easiest measure of dispersion (Figure 2.11). You can calculate the range by taking the largest value in your data and subtract it from the smallest value (Field, 2018). Imagine you obtained information about how many projects your seven top managers have initiated within the last 12 months. If you order these data, you get 1, 2, 4, 5, 7, 9, 11. As you can see, the highest value is 11 and the lowest value is 1. The range is accordingly 11 − 1 = 10. Because the range is based on two extreme observations (i.e., highest and smallest value), it is heavily affected by outliers.

A common workaround to this problem is to calculate the interquartile range. That means, you cut off the top and bottom 25% of values and calculate the range between the middle 50% of values (Field, 2018). For our project data, the interquartile range would be 9 − 2 = 7 (see Figure 2.12).

The interquartile range is not susceptible to outliers, but that comes at high costs. We lose a lot of data. Just as in real life, there are things of which we might want to get rid: weight, debts and/or bad habits. But commonly, we don’t want to lose things because we need them (or feel that we need them). With data, it’s similar. In most cases, we try to keep them and consider them in our analyses.

If we want to determine the spread of our data including all values, we can look at how far away each value is from the middle of the distribution. This gives us a more encompassing and holistic understanding of our data. If we use the mean as the middle point of the distribution, we can calculate the deviation. This is the difference between each value and the mean (Field, 2018).

Equation: Deviation

Deviation = (xix¯)

where xi means the ith value of x and x¯ represents the mean.

If we want to know the total deviation, then we can just add up the deviation for each value.

Equation: Total deviation

Total deviation = i=1n(xix¯)

where ∑ represents sum up, n is the number of values, xi means the ith value of x and x¯ represents the mean.

Let’s illustrate this with our strategy project. Look at Figure 2.13. The x-axis represents the seven managers and the y-axis represents the number of projects that they have initiated. The horizontal line stands for the mean and the vertical lines stand for the differences between the mean and each value. Note that the deviations can be positive or negative (depending on whether a given value lies above or below the mean). When we add up all the deviations, however, the total is zero.

To overcome this problem, we can use the squared deviation. All you have to do is just to square each deviation (difference between the mean and each value) and add these squared deviations together. This is what is known as the sum of squared errors (SS). We can use the sum of squared errors as an indicator for the spread of our data. However, there’s a ‘but’ here. The problem is that the total size depends on how many values we have. If we had another 40 values in our project data example, the sum of squared errors would be considerably higher. The implication is that we cannot compare the sum of squared errors across groups that differ in size. What we want is a measure of dispersion that is not dependent on the number of values that we have.

We therefore calculate the average dispersion, which is also referred to as variance. We take the sum of squared errors and divide it by the number of values minus 1 (Field, 2018). But there’s a catch: the variance provides a measure in units squared because we used the sum of squared errors. This makes the variance difficult to interpret. In our example, we would say that the average error was 13.3 projects initiated squared. It’s obvious that this interpretation isn’t very informative because it makes little sense to talk about projects initiated squared. We therefore take the square root of the variance. This really important measure is called the standard deviation. It’s generally considered the gold standard to express how scattered the values in a data set are. If you do not want to calculate the standard deviation on your own, you can find helpful tools that will do the work for you.1

Equation: Variance

Variance (s2) =Sum of squared errors (SS)N1i=1n(xix¯)2n1

where ∑ represents sum up, n is the number of values, xi means the ith value of x and x¯ represents the mean.

Equation: Standard deviation

s=i=1n(xix¯)2n1

where √ is the square root, ∑ represents sum up, n is the number of values, xi means the ith value of x and x¯ represents the mean.

The sum of squares, the variance and the standard deviations are all measures that determine the spread of data around the mean (Field, 2018). The smaller the spread, the closer are the data to the mean (the more similar they are). The bigger the spread, the further away are the data from the mean. Figure 2.14 compares the number of projects initiated by top managers across two non-profit organisations. Both organisations have the same average of projects initiated (mean of 5.6). However, the number of projects initiated in organisation X spreads much more than the number of projects initiated in organisation Y. This tells us that the number of projects initiated is more consistent across managers in organisation Y, whereas there are greater differences between managers in organisation X.

How to say it

The pitfall of being overly precise: Decimals and large numbers

Accuracy is often seen as an important indicator for data quality. However, being overly precise and accurate may produce a boomerang effect. Many people have difficulties perceiving and interpreting numbers with decimals. For instance, the following numbers are hard to digest: M (Mean) = 3.45678; SD (Standard Deviation) = 1.23987. As a rule of thumb: do not use more than one or two decimals. In specific cases, three decimal places may be adequate to show subtle differences in a distribution. This is much easier to read: M (Mean) = 3.5; SD (Standard Deviation) = 1.2.

Similarly, large numbers are difficult to read, understand and memorise. For example, 5630321 products sold looks like a lot, but it is difficult to grasp how much it actually is. In such cases, it is advisable to use rounded numbers or commas/apostrophes or spaces as a thousand separator.

Rounded numbers: 5.6 million products sold

Commas: 5,630,321 products sold

Apostrophes: 5’630’321 products sold

Spaces: 5 630 321 products sold

Key take-aways

Admittedly, most people don’t burst into joy when dealing with data and statistics. However, in this chapter we have learnt that statistics is the key to make your data speak and that it can help you navigate and break down the complexity of your work environment (and hopefully you have also learnt that statistics is not as bad as its reputation and that it can actually be fun).

Whenever you are dealing with data, there are a few questions that are important to ask from a statistical point of view.

  1. 1. What kind of data do you have (quantitative vs qualitative)? And is it appropriate to analyse these data using statistics?
  2. 2. What level of measurement do the variables of interest have (categorical vs continuous)? What types of averages (i.e., measures of central tendency) are hence most appropriate (mode, median, mean)?
  3. 3. Are there outliers in your data? What are possible explanations for these extreme values? Should they be eliminated for some of the statistical analyses to avoid distorted results?
  4. 4. How does the distribution of your variables look? Does the distribution form a perfect bell curve, or does it deviate from the normal (i.e., skewness and kurtosis)?
  5. 5. How consistent are your data based on their spread?
  6. 6. When presenting data, did you make sure that you are communicating in simple and clear terms (e.g., spelling out variable names; being specific about what you mean; specifying what kind of average you are using and why; not using more than two or three decimals; rounding up large numbers or using thousand separators).

Traps

Analytics traps

  • Failure to determine the right level of measurement of your data (e.g., wrongfully classifying an ordinal variable as an interval variable, although the distances between any two values are not equal).
  • Calculating the median or mean for nominal variables.
  • Calculating the mean for continuous data that is heavily skewed.
  • Calculating the mean for continuous data that has many outliers and not discussing the influence of these outliers.
  • Comparing means without considering the spread of data.

Communication traps

  • Using abbreviations and acronyms (for variable names).
  • Speaking of independent and dependent variables instead of predictors and outcomes when you only observed things but did not manipulate or influence them yourself.
  • Presenting frequency distributions using pie charts instead of histograms or bar charts.
  • Speaking of ‘averages’ without specifying what kind of average you mean (e.g., mode, median or mean).
  • Offering a misleading interpretation of variance by ignoring that it provides a measure in units squared.
  • Overwhelming your audiences with more than two or three decimals and with large numbers that are not rounded up or that are not separated by thousand separators (e.g., commas, apostrophes, spaces).

Further resources

For a short video on how to create a frequency distribution:

https://www.youtube.com/watch?v=amLYLq73RvE

This online calculator calculates the mean, median, mode as well as the range and the interquartile range for you:

https://www.calculatorsoup.com/calculators/statistics/mean-median-mode.php

Note

  1. 1. https://www.calculator.net/standard-deviation-calculator.html https://www.statisticshowto.com/calculators/variance-and-standard-deviation-calculator/
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.165.234