Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 8. Running Descriptive Statistics in Access

Descriptive statistics allow you to present large amounts of data in quantitative summaries that are simple to understand. When you sum data, count data, and average data, you are producing descriptive statistics. It is important to note that descriptive statistics are used only to profile a dataset and enable comparisons that can be used in other analyses. This is different from inferential statistics, where you infer conclusions that extend beyond the scope of the data. To help solidify the difference between descriptive and inferential statistics, consider a customer survey. Descriptive statistics summarize the survey results for all customers and describe the data in understandable metrics, while inferential statistics infer conclusions such as customer loyalty based on the observed differences between groups of customers.

When it comes to inferential statistics, Excel is better suited to handle these types of analyses than Access. Why? First, Excel comes with a plethora of built-in functions and tools that make it easy to perform inferential statistics; tools that Access simply does not have. Secondarily, inferential statistics are usually performed on small subsets of data that can flexibly be analyzed and presented by Excel. Running descriptive statistics, on the other hand, is quite practical in Access. In fact, running descriptive statistics in Access versus Excel is often the smartest option due to the structure and volume of the dataset.

Tip

The examples shown in this chapter can be found in the sample database for this book. The sample database for this book can be found on Wrox.com.

Basic Descriptive Statistics

This section discusses some of the basic tasks you can perform by using descriptive statistics, including:

Running descriptive statistics with aggregate queries
Ranking records in a dataset
Determining the mode and median of a dataset
Creating random samplings from a dataset

Running Descriptive Statistics with Aggregate Queries

At this point in the book, you have run many Access queries, some of which have been aggregate queries. Little did you know that when you ran those aggregate queries, you were actually creating descriptive statistics. It's true. The simplest descriptive statistics can be generated using an aggregate query. To demonstrate this point, build the query shown in Figure 8-1.

Figure 8.1. Running this aggregate query provides a useful set of descriptive statistics.

Similar to the descriptive statistics functionality found in Excel, the result of this query, shown in Figure 8-2, provides key statistical metrics for the entire dataset.

Figure 8.2. Key statistical metrics for the entire dataset.

You can easily add layers to your descriptive statistics. In Figure 8-3, you are adding the Branch_Number field to your query. This gives you key statistical metrics for each branch.

Figure 8.3. Add the Branch_Number field to your query to add another dimension to your analysis.

As you can see in Figure 8-4, you can now compare the descriptive statistics across branches to measure how they perform against each other.

Figure 8.4. You have a one shot view of the descriptive statistics for each branch.

Determining Rank, Mode, and Median

Ranking the records in your dataset, getting the mode of a dataset, and getting the median of a dataset are all tasks that data analysts need to perform from time to time. Unfortunately, Access does not provide built-in functionality to perform these tasks easily. This means you have to come up with a way to carry out these descriptive statistics. In this section, you learn some of the techniques you can use to determine rank, mode, and median.

Ranking the Records in Your Dataset

You will undoubtedly encounter scenarios where you have to rank the records in your dataset based on a specific metric such as revenue. A record's rank is not only useful in presenting data; it is also a key variable when calculating advanced descriptive statistics such as median, percentile, and quartile.

The easiest way to determine a record's ranking within a dataset is by using a correlated subquery. The query shown in Figure 8-5 demonstrates how a rank is created using a subquery.

Figure 8.5. This query ranks employees by revenue.

Take a moment to examine the subquery that generates the rank.

(SELECT Count(*)FROM RepSummary AS M1 WHERE [Rev]>[RepSummary].[Rev])+1

This correlated subquery returns the total count of records from the M1 table (this is the RepSummary table with an alias of M1), where the Rev field in the M1 table is greater than the Rev field in the RepSummary table. The value returned by the subquery is then increased by one. Why increase the value by one? If you don't, the record with the highest value will return 0 because zero records are greater than the record with the highest value. The result would be that your ranking starts with 0 instead of 1. Adding one effectively ensures that your ranking starts with 1.

Note

Because this is a correlated subquery, this subquery is evaluated for every record in your dataset, giving you a different rank value for each record. Correlated subqueries are covered in detail in Chapter 8. In Appendix B, you'll find a primer on using SQL syntax.

Figure 8-6 shows the result.

Tip

This technique is also useful when you want to create an autonumber field within a query.

Getting the Mode of a Dataset

The mode of a dataset is the number that appears the most often in a set of numbers. For instance, the mode for 4, 5, 5, 6, 7, 5, 3, 4 is 5.

Figure 8.6. You have created a Rank column for your dataset.

Unlike Excel, Access does not have a built-in Mode function, so you must create your own method of determining the mode of a dataset. Although there are various ways to get the mode of a dataset, one of the easiest is to use a query to count the occurrences of a certain data item, and then filter for the highest count. To demonstrate this method, build the query shown in Figure 8-7.

This query groups by the Rev field and then counts the occurrences of each number in Rev field. The query is sorted in descending order by Rev.

Figure 8.7. This query groups by the Rev field and then counts the occurrences of each number in Rev field. The query is sorted in descending order by Rev.

The results, shown in Figure 8-8, do not seem very helpful, but if you turn this into a top values query, returning only the top one record, you will effectively get the mode.

Change the Top Values property to 1, as shown in Figure 8-9, and you will get one record with the highest count.

As you can see in Figure 8-10, you now have only one Rev figure: the one that occurs most often. This is your mode.

Note

Keep in mind that in the event of a tie, a top values query shows all records. This effectively gives you more than one mode. You must make a manual determination which mode to use.

Figure 8.8. Almost there. Turn this into a top values query and you'll have your Mode.

Figure 8.9. Set the Top Values property to 1.

Figure 8.10. This is your mode.

Getting the Median of a Dataset

The median of a dataset is the middle number in the dataset. In other words, half of the numbers have values greater than the median, and half have values less than the median. For instance, the median number in 3, 4, 5, 6, 7, 8, 9 is 6 because 6 is the middle number of the dataset.

Tip

Why can't you just calculate an average and be done with it? Sometimes, calculating an average on a dataset that contains outliers can dramatically skew your analysis. For example, if you were to calculate an average on the numbers, 32, 34, 35, 37, and 89, you would get an answer of 45.4. The problem is that 45.4 does not accurately represent the central tendency of this sampling of numbers. Using median on this sample makes more sense. The median in this case would be 35, which is more representative of what's going on in this data.

Access does not have a built-in Median function, so you have to create your own method of determining the median of a dataset. An easy way to get the median is to build a query in two steps. The first step is to create a query that sorts and ranks your records. The query shown in Figure 8-11 sorts and ranks the records in the RepSummary table.

Figure 8.11. The first step in finding the median of a dataset is to assign a rank to each record.

The next step is to identify the middle-most record in your dataset by counting the total number of records in the dataset and then dividing that number by two. This gives you a middle value. The idea is that because the records are now sorted and ranked, the record that has the same rank as the middle value is the median. Figure 8-12 shows the subquery that returns a middle value for the dataset. Note that the value is wrapped in an Int function to strip out the fractional portion of the number.

Figure 8.12. The Middle Value subquery counts all the records in the dataset and then divides that number by 2.

As you can see in Figure 8-13, the middle value is 336. You can go down to record 336 to see the median.

Figure 8.13. Go down to record 336 to get the median value of the dataset.

If you want to return only the median value, simply use the subquery as a criterion for the Rank field, as shown in Figure 8-14.

Figure 8.14. Using the subquery as a criterion for the Rank field will ensure that only the median value is returned.

Pulling a Random Sampling from Your Dataset

Although the creation of a random sample of data does not necessarily fall into the category of descriptive statistics, a random sampling is often the basis for statistical analysis.

There are many ways to create a random sampling of data in Access, but one of the easiest is to use the Rnd function within a top values query. The Rnd function returns a random number based on an initial value. The idea is to build an expression that applies the Rnd function to a field that contains numbers, and then limit the records returned by setting the Top Values property of the query.

To demonstrate this method, follow these steps:

Start a query in Design view on the TransactionMaster table.
Create a Random ID field, as shown in Figure 8-15, and then sort the field (either ascending or descending will work).
Note
The Rnd function does not work with fields that contain text or Null values. Strangely enough, though, the Rnd function works with fields that contain all numerical values even if the field is formatted as a Text type field.
If your table is made up of fields that only contain text, you can add an Autonumber field to use with the Rnd function. Another option is to pass the field containing text through the Len function, and then use that expression in your Rnd function. For example: Rnd(Len([Mytext])).
Figure 8.15. Start by creating a Random ID field using the Rnd function with the Customer_Number field.
Change the Top Values property of the query to the number of random records you want returned. The scenario shown in Figure 8-16 limits this dataset to 1,000 records.
Set the Show row for the Random ID field to false and add the fields you want to see in your dataset.
Run the query and you will have a completely random sampling of data.

Warning

Re-running the query, switching the view state, or sorting the dataset, results in a different set of random records. If you want to perform extensive analysis on an established set of random records that does not change, you must run this query as a Make-Table query in order to create a hard table.

Figure 8.16. Limit the number of records returned by setting the Top Values property of the query.

Figure 8.17. Running this query produces a sample 1,000 random records.

Advanced Descriptive Statistics

When working with descriptive statistics, a little knowledge goes a long way. Indeed, basic statistical analyses often lead to more advanced statistical analyses. In this section, you build on the fundamentals you have just learned to create advanced descriptive statistics.

Calculating Percentile Ranking

A percentile rank indicates the standing of a particular score relative to the normal group standard. Percentiles are most notably used in determining performance on standardized tests. If a child scores in the 90^th percentile on a standardized test, this means that his score is higher than 90 percent of the other children taking the test. Another way to look at it is to say that his score is in the top 10 percent of all the children taking the test. Percentiles are often used in data analysis as a method of measuring a subject's performance in relation to the group as a whole—for instance, determining the percentile ranking for each employee based on annual revenue.

Calculating a percentile ranking for a dataset is simply a mathematical operation. The formula for a percentile rank is (Record Count–Rank)/Record Count. The trick is to getting all the variables needed for this mathematical operation.

To start, follow these steps:

Build the query you see in Figure 8-18. This query will start by ranking each employee by annual revenue. Be sure to give you new field an alias of "Rank."
Add a field that counts all the records in your dataset. As you can see in Figure 8-19, you are using a subquery to do this. Be sure to give your new field an alias of "RCount."
Create a calculated field with the expression (RCount–Rank)/RCount. At this point, your query should look like the one shown in Figure 8-20.
Running the query give you the results shown in Figure 8-21.

Again, the resulting dataset enables you to measure each employee's performance in relation to the group as a whole. For example, the employee ranked sixth in the dataset is in the 99^th percentile, meaning this employee earned more revenue than 99 percent of other employees.

Figure 8.18. Start with a query that ranks employees by revenue.

Figure 8.19. Add a field that returns a total dataset count.

Figure 8.20. The final step is to create a calculated field that gives you the percentile rank for each record.

Figure 8.21. You've successfully calculated the percentile rank for each employee.

Determining the Quartile Standing of a Record

A quartile is a statistical division of a dataset into four equal groups, with each group making up 25 percent of the dataset. The top 25 percent of a collection is considered to be the first quartile, whereas the bottom 25 percent is considered the fourth quartile. Quartile standings typically are used for the purposes of separating data into logical groupings that can be compared and analyzed individually. For example, if you want to establish a minimum performance standard around monthly revenue, you could set the minimum to equal the average revenue for employees in the second quartile. This ensures you have a minimum performance standard that at least 50 percent of your employees have historically achieved or exceeded.

Establishing the quartile for each record in a dataset does not involve a mathematical operation; rather, it is a question of comparison. The idea is to compare each record's rank value to the quartile benchmarks for the dataset. What are quartile benchmarks? Imagine that your dataset contains 100 records. Dividing 100 by 4 would give you the first quartile benchmark (25). This means that any record with a rank of 25 or less is in the first quartile. To get the second quartile benchmark, you would calculate 100/4*2. To get the third, you would calculate 100/4*3 and so on.

Given that information, you know right away that you need to rank the records in your dataset and count the records in your dataset. Follow these steps:

Start by building the query shown in Figure 8-22. Build the Rank field the same way you did in Figure 8-18.
Build the RCount field the same way you did in Figure 8-19.
Once you create the Rank and RCount fields in your query, you can use these fields in a Switch function that tag each record with the appropriate quartile standing. Take a moment and look at the Switch function you will be using.

Switch([Rank]<=[RCount]/4*1,"1st",[Rank]<=[RCount]/4*2,"2nd",
[Rank]<= [RCount]/4*3,"3rd",True,"4th")

Start by creating a field named Rank that ranks each employee by revenue and a field named RCount that counts the total records in the dataset.

Figure 8.22. Start by creating a field named Rank that ranks each employee by revenue and a field named RCount that counts the total records in the dataset.

This Switch function is going through four conditions, comparing each record's rank value to the quartile benchmarks for the dataset.

Note

For more information on the Switch function, see Chapter 6.

Figure 8-23 demonstrates how this Switch function fits into the query. Note that you are using an alias of Quartile here.

Figure 8.23. Create the quartile tags using the Switch function.

As you can see in Figure 8-24, you can sort the resulting dataset on any field without compromising your quartile standing tags.

Figure 8.24. Your final dataset can be sorted any way without the danger of losing your quartile tags.

Creating a Frequency Distribution

A frequency distribution is a special kind of analysis that categorizes data based on the count of occurrences where a variable assumes a specified value attribute. Figure 8-25 illustrates a frequency distribution created by using the Partition function.

With this frequency distribution, you are clustering employees by the range of revenue dollars they fall in. For instance, 183 employees fall into the 500: 5999 grouping, meaning that 183 employees earn between 500 and 5,999 revenue dollars per employee. Although there are several ways to get the results you see here, the easiest way to build a frequency distribution is to use the Partition function.

Figure 8.25. This frequency distribution was created by using the Partition function.

ABOUT THE PARTITION FUNCTION

The Partition function identifies the range that a specific number falls into, indicating where the number occurs in a calculated series of ranges. The Partition function requires the following four arguments:

Partition(Number, Range Start, Range Stop, Interval)

Number (required): The number you are evaluating. In a query environment, you typically use the name of a field to specify that you are evaluating all the row values of that field.
Range Start (required): A whole number that is to be the start of the overall range of numbers. Note that this number cannot be less than zero.
Range Stop (required): A whole number that is to be the end of the overall range of numbers. Note that this number cannot be equal to or less than the Range Start.
Interval (required): A whole number that is to be the span of each range in the series from Range Start to Range Stop. Note that this number cannot be less than one.

To create the frequency distribution you saw in Figure 8-25, build the query shown in Figure 8-26. As you can see in this query, you are using a Partition function to specify that you want to evaluate the Revenue field, start the series range at 500, end the series range at 100,000, and set the range intervals to 5,000.

Figure 8.26. This simple query creates the frequency distribution you see in Figure 8-25.

You can also create a frequency distribution by group by adding a Group By field to your query. Figure 8-27 demonstrates this by adding the Branch_Number field.

The result is a dataset (see Figure 8-28) that contains a separate frequency distribution for each branch, detailing the count of employees in each revenue distribution range.

Figure 8.27. This query creates a separate frequency distribution for each branch number in your dataset.

Figure 8.28. You have successfully created multiple frequency distributions with one query.

TRICKS OF THE TRADE: CREATING A HISTOGRAM CHART IN ACCESS

A histogram chart is a graphic representation of a frequency distribution. You can use these types of charts to easily pick out anomalies in a data collection, by following these steps:

Start your histogram chart by building the query shown in Figure 8-29.
Run the query to display the results of the frequency distribution operation (be sure to run the query). After you run the query, select View
Within a few seconds, Access runs your query and returns the results to a pivot chart object.
Figure 8.29. Build a query using the Partition function to create a frequency distribution.
Figure 8.30. Switch to PivotChart View.
Drag thze Branch Number field to the section of the pivot chart that reads Drop Filter Fields Here.
Drag the Dollars field to the section of the pivot chart that reads Drop Category Fields Here.
Drag the Employees field to gray plot area in center of the chart.
The result is a histogram similar to the one illustrated here in Figure 8-31.
As if that isn't impressive enough, remember you have given yourself the ability to filter by branch number. To filter out one branch, click the drop-down arrow next to the Branch_Number field, shown in Figure 8-32, and remove the check from the All checkbox.
Place a check in the checkbox for the branch you want to show, and then click OK.

The PivotChart in the Figure 8-33 presents the histogram chart for branch 301316.

Figure 8.31. Your histogram chart is complete!

Figure 8.32. To filter by branch, click the drop-down arrow next to the Branch_Number field.

Figure 8.33. You have filtered out all branches, showing only branch 301316.

Summary

Descriptive statistics allow you to profile a dataset and enable comparisons that you can use in other analyses. With descriptive statistics, you can present large amounts of data in quantitative summaries that are meaningful, yet simple to understand. Although many users turn to Excel to perform statistical operations, running descriptive statistics in Access is often the smartest option due to the structure and volume of the data that is to be analyzed.

The simplest descriptive statistics can be generated using aggregate queries (sum, average, min, max, etc.), while more advanced descriptive analyses can be performed by leveraging the power of subqueries and domain aggregate functions. Indeed, using the tools and techniques you have learned thus far, you can create a wide array of descriptive analyses; from determining rank, mode, and median, to creating a frequency distribution.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 8. Running Descriptive Statistics in Access

Create new playlist

Sign In

Sign Up

Chapter 8. Running Descriptive Statistics in Access

Tip

Basic Descriptive Statistics

Running Descriptive Statistics with Aggregate Queries

Determining Rank, Mode, and Median

Ranking the Records in Your Dataset

Note

Tip

Getting the Mode of a Dataset

Note

Getting the Median of a Dataset

Tip

Pulling a Random Sampling from Your Dataset

Note

Warning

Advanced Descriptive Statistics

Calculating Percentile Ranking

Determining the Quartile Standing of a Record

Note

Creating a Frequency Distribution

Summary

Table of Contents for
8. Running Descriptive Statistics in Access