© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
R. OkunevAnalytics for Retailhttps://doi.org/10.1007/978-1-4842-7830-7_1

1. The Basics of Statistics

Rhoda Okunev1  
(1)
Tamarac, FL, USA
 

Clear and concise business questions need to be formulated prior to developing a hypothesis and conducting statistics for an online promotion or investigation of some sort. The variables that are collected should be organized in such a way as to answer those business questions, and the statistical techniques that will be used should be known beforehand so that the data to answer the business questions are at hand. Data is just a bunch of numbers until the data is cleaned, organized, and arranged. This part is so essential and much overlooked. Without understanding the basic descriptive statistics and how the middle of the data should look, as well as knowing where the endpoints or tails of your data are and how large the error rate is in the data, a “true” analysis cannot be done accurately. Once the data is rigorously scoured and reviewed, it can, at least, start to tell a clear and consistent message to a targeted audience. That story’s journey starts to develop with a clear and understandable business question. With the use of the appropriate statistical analysis and its significant results, and with charts to emphasize the main takeaway to get the points across simply and clearly, the essential and pertinent findings that need to be described will come to life. These findings, which are found from a business question, are the essence of the journey that will help a company realize their direction, purpose, and potential.

Descriptive Statistics

The term statistics refers to mathematical procedures that are used to collect, organize, clean, and summarize data. This is important in retail, because between inventory, customer, and sales data, your overall information sets will be large, and the sheer volume of information can be overwhelming. Cleaning and summarizing your data is key to understanding and interpreting the data accurately. Once that is done accurately, all the statistics are used to create your business narrative and help to optimize your ongoing strategy.

The first concept to understand in statistics is how to differentiate between a population and a sample. A population is the entire group that is being investigated. Studying an entire population is usually costly, and it is difficult to impossible to recruit an entire population. Therefore, most investigations and surveys use a random and representative sample or a random sample, which is a smaller group of the true population. A randomized sample is a group that represents the population that is to be studied.

For example, imagine a company with 100 employees in the research department and 1,000 employees in the company as a whole. The study’s purpose is to assess the quantitative skills of the employees in the research department. In this case, the relevant population is the research department (100 employees), not the whole company (1,000 employees). So, a sample of, say, 30 employees or more should all come from the research department. If 30 employees from across the entire company are chosen by the administrator to participate, this group is not representative of the research department because it includes other departments as well. A company hires individuals from various backgrounds with diversified skills, and the quantitative skills for a sales manager or administrative assistant are not expected to be the same as for a researcher.

To create a random sample of the group of interest, the researcher may take 30 employees from the 100 in the research department. But those employees cannot be handpicked, and you cannot take the first 30 employees because that would not be random. Even when humans attempt to choose items “randomly,” it is not accurate. Instead, a possible alternative would be to create a randomly generated table from Excel and to create an algorithm as demonstrated soon in the book or with a statistician for how this sample could be chosen. This way the sample selected would truly be random.

Retailers often use sample sets when analyzing order data for trends by category, when analyzing CRM populations for various studies or surveys, and when analyzing user behavior or visits to websites. For example, if you were conducting a survey of your CRM database, you could randomly generate a stratified sample based on customers who fit certain segments, such as having completed a purchase or having visited the website within a certain period of time. Generating random samples gives you a representative estimate of what is happening in the larger population as a whole. Here are some more statistical differences that should be kept in mind between a sample and a population.

A parameter is used to describe a population, and statistics are used to describe that same measure of a sample. The normal curve has three basic parameters and corresponding statistics. They are called the mean, the standard deviation, and the variance. The parameters for the population are called mu, sigma, and sigma squared. The random sample statistics are called x-bar, s, and s2, respectively. These measures describe the population and sample and explain the importance of the normal curve.

The retail business utilizes these measures primarily when analyzing sales data such as orders and daily sales. For example, the mean shows up in key performance indicators such as Average Order Value, Average Units per Transaction, and Average Sellthrough. Variance and standard deviation are building blocks for more advanced applied analytics and appear in later techniques in the book.

One way to create a random sample is by using a random sample generator and then developing an algorithm that would use it. Although this is one simple technique to develop a random sample, another way is to work with a statistician to develop a stratified sample, which is much more complicated. There are many different types of ways to generate random samples, and each type of sample will answer a different question and will give you a different random sample and result. Therefore, it is important to be clear in the beginning of a study which type of representative sample the study is extracting from the population. Some sampling techniques are harder to develop than others, so at times a statistician should always be consulted. The data analysis toolkit in Excel has many different types of random digit generators that could be utilized to create random digits, and the normal curve will be created using the normal random digits in the toolkit in Chapter 2.

Even with a random sample that should yield a representative estimate, there will still be a sampling error, which is the difference of mu and x-bar, or the mean of the population and the mean of the sample. A population will never be totally captured by sampling so there will always be some difference between a sample and a population. This difference can be minimized by randomizing the data.

Once a sample is randomized, it is time to start looking at the data to make sure that it is “clean.” Moreover, this is the time to start using Excel and to start learning how to use its statistical functions in Excel. Basic knowledge of Excel is expected for reading this book because for the most part only the code is given without a lot of explanation. The following are some types of data problems to watch out for and make decisions about. When encountering complicated issues like these, a statistician may need to be consulted.

The first and easiest errors to detect are the outliers, or data that just stands out and does not fit with the other data. Outliers, for the most part, skew your data or maybe do not even make sense. This type of problem could arise from inputting a number wrong in the computer or from an error written in the program. A problem also could then arise from having a subject that should not have been included because it did not fit the criteria. In any event, you want to determine what type of problem occurred and, as much as possible, try to understand and then eliminate these problems by correcting the data or possibly even deleting data, adding dummy data (like the number 999), or inserting information that makes sense (like the mean of the numbers) in the data point when necessary. There is a whole field on missing data, so care should be taken to rectify the situation. However, the variable is altered, so it should be done for a logical reason, and it should be consistent for each variable.

Descriptive statistics are used to summarize, organize, and simplify the data and outliers can be observed. Descriptive statistics are mostly measures of central tendency and measures of variability. These are the terms that the book will discuss now. The next section will address these terms for a sample, not a population, because whole populations are rarely analyzed.

Measures of Central Tendency (Mean, Median, Mode)

The most popular and used descriptive statistics are measures of central tendency. The measures of central tendency show where data is most centrally located or where the center of the data lies. These measures include the mean, median, and mode. The mean, often referred to as the average, is the most popular of all statistics. To calculate a mean, sum up all the numbers and then divide by the count of the numbers. The mode tells which items in the data are most frequently occurring. The median is the middle number of the sample after all the numbers are put in sequential order.

An example for these descriptive statistics is if there is a sample of five numbers: 3,7,7,4,5.

The mean is the sum of the numbers (3 + 7 + 7 + 4 + 5 = 26) divided by the count (5), or 26/5 = 5.2. The mode would be 7 because that is the most frequently used number in that sample: there are two of them, while all the other numbers occur only once. A sample can have no mode, one mode, or many modes. To determine the median, put all the numbers in order from least to greatest: 3, 4, 5, 7, 7. The number five here would be the middle number. Because there is an odd number of elements, therefore there is a true middle. However, for an even number of items in this list, take the mean of the two middle numbers to determine your median. For example, if the following list consists of six numbers such as 3, 4, 5, 6, 7, 7, the median would be (5+6)/2 = 5.5.

The normal curve is used when the mean, median, and mode are all approximately equal. This is because the data, at this point, is symmetric about the mean: half the data is on one side of the curve’s mean, and half is on the other. They are mirror images of each other. The median is used mostly when the data is skewed in some way, like housing data. Housing statistics usually references the median income level or unit price of a home because the very highest prices and incomes are extremely high and would misleadingly skew the average. The mode is used when a whole number is essential, for instance, for the number of children in a household. In this case, 2.5 children per household in a family does not make sense. Instead, it could be written that a typical or mode of a family is three children.

Measures of Variability (Range, Variance, Standard Deviation)

Unlike measures of central tendency, which shows where a centrally located area in the data is, measures of variability are sometimes referred to as noise, error, or volatility of the data. Variability is a quantitative measure of the difference between one variable and another, or the degree to which the variables are spread apart from each other. The most common measures of variability are the range, standard deviation, and variance.

Returning to the sample 3,7,7,4,5, the range is the highest (or maximum) number minus the lowest (or minimum) number. The range would be 7 – 3 = 4.

The variance for a sample is more complicated. First, take the difference of each X from the mean of the sample X (called the x-bar). Then take that distance of each score from the mean score squared and divide it by n, or count minus one. The formula is shown here:

       Σ( X – x-bar)2

S2 = ----------------

(N–1)

The standard deviation for a sample is the square root of variance and is represented by S.

Here are the steps to calculate the variance:
  1. 1.

    Calculate the mean (which is explained in the “Measure of Central Tendency” section).

     
  2. 2.

    Take every X and subtract it from the mean (x-bar).

     
  3. 3.

    The sum of the numbers in step 2 will always add up to zero. This is because the numbers center around the mean.

     
  4. 4.

    Because of this, the square of each number from step 2 is used. Therefore, the variance will always be positive.

     
  5. 5.

    To calculate the variance, take the Σ(X – x-bar)2 and divide that whole quantity by (n–1). That number is the variance.

     
  6. 6.

    The square root of this number is the standard deviation. The standard deviation is always positive or zero.

     

The variance and standard deviation will be discussed further when the normal and t-distributions are described.

Example of Standard Deviation and Variance

For the sample X = 3,7,7,4, 5 where the mean was calculated in the earlier “Measure of Central Tendency” section, the preceding steps are used to calculate the following:

X    Mean  X – mean  (X – mean)2

3    5.2  –2.2    4.40

7    5.2  1.8    3.24

7    5.2  1.8    3.24

4    5.2  –1.2    1.44

5    5.2  –0.2    0.40

____________________________________________

26    ---     0    12.72  SUMS

The variance for a sample is as follows:

S2 = ⅀(X – mean)2/(n–1) = 12.72/(5–1) = 3.18

The standard deviation for a sample is as follows:

S = sqrt(3.18) = 1.783

Another method to calculate the variance for a sample is as follows:

S2 = S*S = 1.783 *1.783 = 3.18 (rounding errors can occur)

These three variability terms (range, standard deviation, and variance) are referred to as noise, volatility, or error variables, and often they could point out potential problems with the data when there are outliers. The range is used when you want to know the distance from the highest number and the lowest number. For instance, if a relay race is timed and the participants want to know the difference in time between the fastest runner and the slowest runners, a range would be utilized. When the range appears too big or too small, the individual numbers that fall near the minimum to the maximum should be examined to determine why this is occurring. Is it an outlier or a subject that should not have been included to begin with?

However, the standard deviation and variance (whose formulas were shown previously) are related, as shown earlier. The variance is more theoretical and, for the most part, used in formulas. The standard deviation is descriptive and could show, based on an approximately normal curve (which is the only case where the curve is symmetric and the mean, median, and mode are approximately all the same), how much dispersion is in the data. So, for example, for normally distributed population let’s say the standard deviation is 10 and the mean would be 50. The standard deviation would be subtracted from the left and added on the right of the mean. One standard deviation from the mean would be 60 on the positive (right) side and 40 on the negative (left) side of the curve. Two standards from the mean would be 70 on the positive side (right) of the curve and 30 on the negative (left) side of the curve. This book develops this concept in more detail in the next chapter on normal curve. Thus, the standard deviation measures the distance from the center of the normal curve in either direction because it is symmetric about the mean. The variance is used in formulas to analyze the noise or variability. As explained earlier, variance is the square of the standard deviation.

When the standard deviation is calculated, the first step is to calculate the difference between the mean and its associated set of numbers, X. The sum of the differences will always be zero, which is called the regression to the mean. The next step is to square the mean and its associated X so that the sum is not zero, and, therefore, the standard deviation or the variance can never be negative.

These concepts of standard deviation and variance will be elaborated on in the next sections with more examples and may be clearer to you at that time.

Computational Example

Now, it is time to start computing in Excel. Figure 1-1 illustrates the first Excel code program in this book for the measures of central tendency and measures of variability. This is the same data used earlier when these measures were discussed. First, the output in Excel will be shown, and then the code will be displayed.
Figure 1-1

Code and descriptive statistics using Excel

Cleaning the Data Using Descriptive Statistics

It is important to input the data without any errors and therefore make sure the data is “clean” using some descriptive analysis. Here are some ways to identify if there are any errors:
  • When the mean, median, and mode are all the same number, or at least almost the same number, the data can be approximately normal. This is what you want to see. If the data is normally distributed, then standard normal statistics can be used.

  • Outliers skew the data and move the mean of the distribution around and thereby may change the results of the statistics. Charts often are used to point out outliers more easily than just looking at the raw numbers. In addition, the minimum data point, maximum data point, and range are used to identify egregious outliers, which may skew that data. Moreover, the researcher needs to identify what is causing this error and where it is coming from—in other words, whether the problem is from an input data error, computer programming mistake, or real-life issue (such as a customer should never have been included in the dataset to begin with).

  • The datasets either have to be from a normal distribution or have 30 or more data points to be large enough to explain the results. Use n (or count) to make sure there are enough data points in each group separately and combined, as well as in each separate group that the researcher decides to analyze.

  • It is best to analyze the groups that have approximately the same number in each. The data is hard to analyze when there are major differences in the number of counts in between each group.

Note

A discussion of small and large enough samples will be discussed in Chapter 6 when discussing the charts. When there is a large discrepancy between the groups being studied, the data could be skewed and affect the data results for the statistical test. When there is a grossly different sample size occurring, a statistician should be consulted.

  • Whether the company you are working at has a small or large data set, the same steps need to be used to clean the data.

Summary

As emphasized in the introduction of this chapter, the business questions of the company need to be addressed initially at the start of promotions, and investigations at hand need to be understood clearly. This is important to do so that you have the right variables in the proper form to aid in answering your research question. Descriptive analysis utilizes the basic measures of central tendency and measures of variability to clean, organize, and arrange the data. This step is one of the most tedious but extremely important. Without clean and well-organized data, no statistics can be accurately analyzed or be understood, and no “true” results will be conveyed. Once the data is massaged, the next steps will be to conduct statistical analysis and analytics. If possible, pictures and charts should be used to emphasize the most important findings. These topics will be discussed later in the book.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.246.193