Basic Characteristics of Variables

Introduction to Variable Characteristics

Once you have data, there are various things you may want to analyze about it. One vital reason for wanting to use numerical data is that it is easier to analyze, and there is more you can do with it.
For example, Figure 3.2 Sample spreadsheet of performance appraisal data shows performance appraisal scores for the period 2003 to 2006 using a scoring system for employees scoring them from 1 to 10.
Figure 3.2 Sample spreadsheet of performance appraisal data
Imagine you want to analyze the 2006 performance appraisal data from Figure 3.2 Sample spreadsheet of performance appraisal data. You could:
  • Analyze how many of each category of scores there are (e.g. how many 6s).
  • Ask what the score of the average employee is.
  • Ask how much spread there is in performance between the different employees.
  • Compare relationships between different bits of data.
Take a look at the 2006 scores from Figure 3.2 Sample spreadsheet of performance appraisal data. If we look at the column of data, then we see that we have a range of data from 2 (the lowest in the list) to 10 (the highest). If we order it from lowest to highest then it is:
2, 3, 5, 5, 6, 6, 6, 7, 8, 10
We see two major things about the data already:
  • It is data that seems to run from low to high with a fair number of options.
  • It has a center; that is where the middle of the data is situated. With the naked eye you would probably place the center at about 6.
  • It has a spread, in other words it is spread out on either side of the center.
There is more we could ask: what kind of data is it (compared, say, to an appraisal system where employees are ranked only on a system of “Below Average,” “Acceptable” and “Above Average”)?
There are therefore multiple aspects or characteristics of numerical data that we need to understand – these characteristics and our understanding of them have a fundamental impact on our ability to analyze them as well as our understanding of the analysis.
As we have seen, each variable is measured across multiple observations, such as a survey question measured across many employees. Because the variable is measured across many observations, which can differ in their response to the variable, there are a range of measured responses to each variable. This brings up three very important characteristics of variables:
  • Type of variable. This depends on what was being measured, and fundamentally affects the type of data analysis you can use the variable in.
  • Centrality: This is the most representative response on the variable across the whole sample.
  • Spread: This represents the range of responses across the sample.
Let us explore these three characteristics of data more thoroughly.

Variable Characteristic 1: Type

Basics of Variable Types

Figure 3.3 Types of variables represents the four types of variables on a scale that increases from left to right in level of statistical information. As seen there, the four types of variables are categorical, ordinal, interval and ratio data, although we can often place interval and ratio together and call them ”continuous data.”
Figure 3.3 Types of variables
The following sections expand on the differences in the types of variables.

Ratio Data

Ratio data has a natural zero point and can take any value upwards from it. For example, take the age of your employees:
  • Zero (0) days old is an absolute zero.
  • The difference between someone 100 days old vs. 1000 days old is meaningful. This is a characteristic of ratio data: the difference between two scores is a meaningful piece of data.
  • In addition, the ratio between any two scores in such data is meaningful. For example, 30 years old is half as old as 60 years old. This is another characteristic of ratio data: the ratios between scores are meaningful.

Interval Data

This form of data is similar to ratio data in that technically one can have any value in a range, but there is no absolute zero.
  • In interval data, the magnitude of difference between two data points is mathematically meaningful, but since there is no natural zero point, ratios do not make sense.
  • For example, take time. The difference between 1 January 2009 and 30 September 2007 is 459 days, but based on what zero point? The Big Bang? Birth of Jesus? SAS uses 1 January 1960 as the zero point for time, Microsoft Excel uses 1 January 1900. These are arbitrary zero points. In such cases, the ratio between scores is meaningless, for example, 1 January 2009 divided by 30 September 2007 does not make mathematical sense!

Ordinal Data

This type of data indicates order (i.e. rank) in a series, but the difference between scores is not mathematically meaningful:
  • For instance, say you ask someone to tick age groups in a survey where 1=0-16 2=17-25 3=26-35 4=36-45 etc.
  • A score of 1 (age 0-16) indicates a lower age than 2 (17-25), but the difference between a score of 1 and 2 in this example is not mathematically meaningful!

Categorical Data

Here, a data point is a number that represents membership in some category:
  • For example, you could be asking people to indicate their marital status in a survey.
  • You could give a list of possible statuses, each one of which would be scored. In one configuration, you could score 0 = Married, 1 = Single, 2 = Divorced. But these numbers are arbitrary, and could be any order!
  • For example, an alternate scoring system could be 0 = Single, 1 = Divorced, 2 = Married. Do you see that the numbers are not numerical; they are merely tags? The actual data has no numerical value and could just as well be words.

The Importance of Variable Type

There are good reasons for covering this material. When it comes to statistical analysis of this data, interval and ratio (i.e. continuous) data have the best statistical qualities. As we will see later, when it comes to relating sets of data through techniques such as regression, the type of the main (dependent) variable you are trying to analyze fundamentally affects the type of statistical analysis you do.
For instance, in the case of the Chapter 1 case study on Accu-Phi, the main focal/dependent variable is sales. This is a ratio variable (it runs on a fine scale from low to high; differences between sales levels are meaningful; it has a natural zero) and therefore continuous. Because it is continuous, certain types of statistics can be employed on it.
Generally, when measuring variables, a good guide is to try measuring them as interval or ratio data. More on this is discussed in the next chapter.

Variable Characteristic 2: Centrality

As mentioned earlier, when looking at a variable, centrality is the most representative data point, the score most of the sample tends towards. The average or mean is an example of a specific centrality statistic, although as explained in Chapter 7 you cannot use the average for all data. The type of centrality measure depends on the type of variable. I discuss different types of centrality statistics in Chapter 7.
A quick note: measures of centrality are by far the most used and important measures of practical statistics, especially in business. Market research, for instance, is principally concerned with average levels of variables such as spending or customer attitudes, especially between geographical or demographic segments. Financial research might look at average returns in various investment portfolios. HR research wants to know about variables such as average time to fill a vacancy.
However, measures of centrality are only a start to really understanding data. The spread of the data is also crucial.

Variable Characteristic 3: Spread

Remember that different observations vary in their scores on a given variable. If you measure age across a sample of multiple people, their ages will differ. Spread expresses this crucial characteristic of a variable, namely that there is a range of data on either side of the middle. The larger the spread of the data away from the middle, the less that the central score represents the whole dataset and the more that every observation is different from the others.
As an illustration, take a look at two different sets of data in Figure 3.4 Examples of different spreads in data below.
Figure 3.4 Examples of different spreads in data
  • In the top set of data, data points of the same value have been stacked on top of each other to illustrate density – see that three people had scores of 6. Rough spread is indicated by the dotted arrows.
  • The data at the bottom is a dataset with less spread (the dotted arrows are smaller). Because we have taken away the 2 and 10, there is less data far away from the center point. Therefore, on average the data is more clustered around the center point and less spread out.
Data that has a natural order from low to high (interval, ratio, and even ordinal data) inevitably has a range: that is, such variables range from some low to some high. The ages of workers in the company, for instance, might range from an absolute low of 17 to a high of 67. Categorical data also has spread, but this is harder to think about on a low to high basis, for such data you are thinking about relative distribution of observations between categories.
Minima and maxima of sets of data can be useful to know as an absolute measure of spread, because they tell you the absolute lowest score to the absolute highest. They do not, however, tell you anything about where in the range the center of the data lies, or where the majority of the data lies. The employee age endpoints of 17 and 67 might be extreme outliers (there might be very few employees with age as low as 17 or as high as 67). Statistical measures of spread that capture the majority of observations without also taking in the extreme outliers at the far ranges are better measures. There are several such spread-base statistical measures, such as the standard deviation and inter-quartile range, as discussed in Chapter 7.
Spread might be the most important thing in statistics. I encourage you to read the spread sections later in the book carefully until you understand them. Later, we will see why this concept is so critical.

Choosing the Right Variables and Measures

Business statistics succeeds or fails depending on whether you pick the right data (observations and variables) to study, and whether you measure them in a valid and reliable way. Chapter 4 discusses these issues in more depth, as well as the broad challenges around actually gathering, capturing and cleaning datasets.
Last updated: April 18, 2017
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.225.57.126