4

Why Is Market Basket Analysis Difficult?

Charades is a game where a single person tries to communicate with a group of people, usually three to five people, without using any words. That person is assigned a secret word or phrase. The goal of the game is to communicate that word or phrase to the group of people until finally someone in the group is able to correctly guess the word or phrase. What can that person use to communicate to the group? Basically, the person can use his/her body and that’s about it. A common method is to tug at your ear, which means “sounds like...” and then you try to communicate a word that is different from the word assigned to you but sounds like the word assigned to you. Sometimes you have to do that when the word or phrase is very difficult to act out with your body. For instance, nuclear holocaust, World Series, and Antietam are much more difficult to communicate than dog, cat, or run.

Imagine a game of charades that has one person trying to communicate hamburger meat goes with hamburger buns, and then another game of charades that has two million people trying to communicate hamburger meat goes with hamburger buns. It’s not going to look like an Olympic synchronized swimming team. No, in fact it’s going to look like two million people waving their arms and legs and tugging at their ears. It’s going to look like chaos on a grand scale. Then, imagine that some of them are really trying to communicate hamburger meat goes with hamburger buns but without cheese, and others are trying to communicate hamburger meat goes with hamburger buns and blue cheese, and still others are trying to communicate tofu meat goes with hamburger buns. Then, imagine that they aren’t trying to communicate with you at all. No, they are just going about living their lives and you are trying to interpret their actions, hoping to discern some sort of consistent pattern to their actions. Add in the people who were just buying hamburger meat, the people who were just buying hamburger buns, and the people who are going to come back after they get home and realize they bought hamburger meat rather than turkey meat and you’re close to appreciating the difficulties of Market Basket Analysis.

Noise

The first and most noticeable difficulty with Market Basket Analysis is the noise—not the kind of noise you hear, but rather the kind of noise you query. When looking for an affinity between hamburger buns and something else, you find that hamburger buns correlate with a thousand other objects, each with its own affinity. At a different time of day hamburger buns correlate with a slightly, but not exclusively, different set of a thousand objects. However, on the weekend all that changes and hamburger buns correlate to only a few hundred objects.

The chaos in all that noise is neither chaos nor noise. Rather, it is the presence of layers upon layers of various patterns. The data prepared for a college class in statistics or forecasting presents an intended pattern. The student is able to apply a statistical method recently learned in class to find the single pattern that pervades the data. The data from your enterprise has not been prepared by a college professor and does not demonstrate a single pattern. No, the data from a real enterprise has multiple patterns. You will probably have to discover the patterns one at a time.

Some of the data will have no pattern at all. The lack of a pattern is a manifestation of randomness. Randomness is an essential element of any statistical model. As a statistical model is able to explain the pattern of the data, the remaining random variance decreases to an acceptable level. No statistical model will be able to explain 100% of the data. Market Basket Analysis is no exception. Therefore, while Market Basket Analysis can find objects that have a strong positive affinity for each other, it cannot identify objects that correlate at 100%, because no such correlation exists. Objects that correlate near 100% would be considered complementary objects because they seem to complement each other. Objects that correlate near 0% would be considered substitutes because they seem to never occur simultaneously in an Itemset. But, even in the case of substitutes, randomness indicates that substitute objects will occasionally occur simultaneously in the same Itemset. An object that correlates equally with all other objects is considered to be an independent object. An independent object will correlate a little higher or lower with a specific group or during a specific time of day or through a specific channel for a day and then will correlate a little lower or higher on another day. In that way, randomness indicates that the lack of correlation in an independent object will fluctuate. In general, randomness will cause the correlation, be it positive, negative, or neutral, between two objects to fluctuate.

Conversely, fluctuations in correlation may not be randomness. Rather, fluctuations in correlation may be the presence of multiple patterns overlaid on each other. For most retail stores the two largest patterns are period of the week (i.e., weekday and weekend) and period of the day (i.e., morning before the workday begins, daytime during school, after school and before the workday ends, and after the workday ends). A wonderful example of such an apparent fluctuation dates back to the days before cable television and digital video recorders. A municipal water works department noticed unusual fluctuations in their water pressure. After significant investigation, they found the source of the fluctuation in water pressure. When a popular sporting event or movie was broadcast on network television, members of that city were delaying a trip to the bathroom so they could see the end of the television show. The fluctuation appeared first as a spike in water pressure as fewer citizens were consuming water and then as a drop in water pressure as an usually large number of citizens ran to the bathroom and flushed their toilets. Eventually, that water works department came to rate the popularity of television shows by the drop in water pressure that occurred at the conclusion of that television show. In this example, two patterns overlap the normal consumption of water. The first overlapping pattern is the tendency to wait until a television show is over to go to the bathroom. The second overlapping pattern is the popularity of the television show, which influences the number of citizens willing to delay their trip to the bathroom. In this way, fluctuations in correlation may not be randomness but rather may be the presence of multiple simultaneous patterns not yet discovered.

Randomness is not actionable. The goal of Market Basket Analysis is to produce actionable information. Significant fluctuations in correlation between two objects indicate a pattern of randomness that can be neither leveraged nor avoided. If the enterprise cannot forecast the correlation between two objects, then the enterprise is left to the whims of chance. For this reason, randomness is an indicator of a future area of investigation. Only when the overlaying patterns of correlation are each identified can they become actionable information for the enterprise.

Large Data Volumes

The best way to reduce the effect of truly random fluctuations in correlation is by incorporating large volumes of data. In statistical analysis this is the concept of statistical significance. In general, a larger sample set generates a result with higher statistical significance and a smaller sample set generates a result with lower statistical significance. Higher statistical significance occurs as an increasing portion of the randomness in a sample set is explained. This does not mean that a large sample set will by its own power and volume explain away all randomness. If that were the case, Market Basket Analysis would have been reduced to an exercise in data gathering, rather than data analysis. Instead, larger data volumes (i.e., large sample sets) simply increase the scope and perspective of the world as viewed through the data, like seeing the forest rather than the trees.

To illustrate the effect of sample size on the conclusions drawn during Market Basket Analysis, consider an investigation into the correlation between two objects—milk and cookies. Recognizing the wonderful experience of dunking a cookie into a glass of milk and then eating that cookie, you would expect milk and cookies to have a high affinity for each other. In a sample set of two Itemsets you find the following:

  • Milk and no cookies

  • Cookies and no milk

In this abbreviated sample set milk and cookies seem to be substitutes as they never occur simultaneously in the same Itemset. However, recognizing the brevity of the sample set you include another Itemset. Now the sample set includes the following three Itemsets:

  • Milk and no cookies

  • Cookies and no milk

  • Milk and cookies

A conclusion drawn from this second sample set would indicate that two-thirds of the time, milk and cookies are substitutes, and one-third of the time milk and cookies are complementary. Not until you increase the number of Itemsets in the sample set does the picture become a bit more clear:

  • Milk and no cookies

  • Cookies and no milk

  • Milk and cookies

  • Milk and cookies

  • Milk and cookies

  • Milk and cookies

  • Milk and cookies

  • Milk and cookies

  • Milk and cookies

  • Milk and cookies

  • Milk and cookies

  • Milk and cookies

  • Milk and cookies

The increased sample set allows the large number of milk and cookies Itemsets to demonstrate the true affinity between milk and cookies. In this way, the milk and no cookies and cookies and no milk Itemsets decrease in significance as the volume of data grows. Interestingly, they also prevent milk and cookies from ever achieving a perfect affinity, that is, 100% correlation.

The same illustration can be drawn for two objects that are substitutes. For example, fifteen Itemsets wherein the two objects occur simultaneously would seem to indicate that those two objects are complements. But then another fifteen thousand Itemsets wherein the two objects occur exclusively of each other indicate that the two objects are actually substitutes. Likewise, an independent object when investigated myopically may seem to have an affinity for a specific object. Only when the scope of that investigation is expanded do you realize that the affinity you discovered was between the time of day and the second object. The first object was maintaining its independence while the second object was increasing its occurrence during that time of day. For the reasons shown in these examples, a large sample set can be expected to generate conclusions that are more significant than those generated by a small sample set.

The data gathered for Market Basket Analysis can be “large” in multiple ways. These manifestations of a large sample set include the following:

  • Time—The span of time included in a sample set. A sample set that spans fifteen years is larger than a sample set that spans fifteen days.

  • Groups—The number of enterprise regions, divisions, channels, and so on, included in a sample set. A sample set that spans all areas of the enterprise is larger than a sample set that spans only the mid-Atlantic region or the aeronautics division.

  • Completeness—The portion of the universe of Itemsets represented in a sample set. A sample set that includes the entire universe of Itemsets is larger than a sample set that includes half the universe of Itemsets.

These are the three dimensions along which a sample set can increase in scope. Each of them has its own distinct impact on the significance of the conclusions drawn from the Market Basket Analysis.

Time

Time in a Market Basket Analysis refers to the span of hours, days, weeks, months, and years included in the sample set. If a sample set includes only one day, then all the conclusions drawn from that sample set carry the implicit understanding that all days are like that day. If a sample set includes a week of days, then all conclusions drawn from that sample set carry the implicit understanding that all Mondays are like that Monday, and all Tuesdays are like that Tuesday, and so on. Some patterns follow the pattern of a week wherein the weekdays and weekend have patterns distinct from each other. Expanding a sample set to include a full month, a set of contiguous months, or a full calendar year increases the time-based patterns that can be discovered. The discovery of seasonal patterns requires multiple contiguous years of data. The data points in contiguous years allow the analysis to compare the months of January, the Fourths of July, the Twenty-fifths of December, and any other periodic pattern that repeats each year.

Time-based patterns have multiple patterns beyond the date (e.g., July 4, December 25). The attributes of dates hold their own set of patterns. The date of August 27 can be a weekday one year and then fall on the weekend in the next year. That would make one August 27 different from another August 27, even though they are both August 27. Time-based patterns such as weekdays, weekends, and holidays are time-based patterns that are based on the attributes of dates rather than the dates themselves. Time and the attributes of time, therefore, contribute to the scope and perspective of Market Basket Analysis.

Groups

An enterprise expansive enough to have a data warehouse probably has multiple groups within itself. Those groups can be geographic (e.g., north, south, east, west, north-central), operational (e.g., distribution routes, broadcast areas, contracted affiliates), demographic (e.g., urban and inner city, rural and agricultural, DMA), governmental (e.g., city limits, state lines and national boundaries), or lines of business (e.g., retail sales, service and repair, online sales). The groupings of the enterprise can influence the data available to, and included in, the Market Basket Analysis. The groups themselves are attributes of the data they present to the analysis and should stay connected to the data during analysis.

Combining all groups of the enterprise in the data available to the analysis should have the effect of enhancing the statistical significance of the conclusions drawn during analysis. For the reasons mentioned above in the milk and cookies example, a large set of data obscures the outliers as it highlights patterns in the data. Unfortunately, combining all groups of the enterprise can have the reverse effect. A holiday in Norway may not be a holiday in France and vice versa. So, the attributes associated with the groups of the enterprise may be needed to stratify the data, separating Norway from France. By including data from groups within the enterprise and the attributes of those groups, the analysis can leverage the large data volumes to increase the statistical significance of conclusions drawn by the analysis and simultaneously identify patterns that are distinct to each group.

Completeness and Data Sampling

Television news programs in the United States report election results on the night of an election based on a sample of the voting precincts, possibly as few as 1% of the total precincts. The statistical insignificance of such a small sample set renders the predicted outcome completely irrelevant. The most famous such occurrence is the 1948 presidential election wherein the Chicago Tribune printed and distributed newspapers declaring Thomas E. Dewey the winner of the presidential election. The irony was that Harry S. Truman won that presidential election. Apparently, the data sample used by the Chicago Tribune was a bit small, premature, and eventually embarrassing.

The cost of gathering data is the reason for data sampling. Statistics includes the standard practice of data sampling because statistical practices are best employed where the 80/20 rule applies. The 80/20 rule tells us that 80% of the value can be achieved with 20% of the cost, and that the remaining 20% of the value can be achieved with 80% of the cost. Statistics also includes the concept of statistical significance to help a statistician monitor the value of the statistical results relative to the cost of gathering data. When the cost of measuring an entire universe of data values (e.g., chemical content of the grains of sand on the beach, number of fish in the ocean, chemical parts per million of the water in the Mississippi River) is prohibitively high, a sample of that universe of values is statistically extrapolated to represent the entire universe of values.

Contrast that with a universe of values that is relatively low in cost to gather. For example, a schoolteacher measures the academic progress of all the students in a class, rather than just the front row, by administering a test to all the students. In that case, the value derived by sampling is extremely low and the cost of measuring the entire universe of values is even lower. So, the teacher measures the academic progress of all the students, rather than a sample of the students.

For Market Basket Analysis the same concepts apply. A data sample, rather than the whole universe of data, is used when the 80/20 rule indicates that 80% of the value of the Market Basket Analysis can be achieved with 20% of the cost. In other words, when the cost of including all the data is prohibitively high, and the value of the analysis can still be delivered with a sample of the data, then a sample set of data can be used in the Market Basket Analysis. In most Market Basket Analysis applications, the data to be analyzed is internal to the enterprise. The data, therefore, is available to the data warehouse and to the Market Basket Analysis. In that scenario, the majority of the cost is the disk storage and CPU throughput to process the data. For an enterprise with ten years of data available in electronic form, the whole universe of data may reasonably be considered too costly to be included in the analysis.

Like any statistical analysis, the data provided to the Market Basket Analysis application may be only a sample of the whole universe of data. If the sample is large enough to provide statistically significant results, the use of data sampling can provide the benefits of analysis without excessive overhead. In the iterative world of Market Basket Analysis wherein an iteration of analysis will lead to another iteration of analysis which will lead to yet another iteration of analysis, the delivery of value without excessive overhead in each iteration can quickly become a key to the success of a Market Basket Analysis application. In that way, a complete set of data can be good but not always best, and a sample set of data can be incomplete but not always worst.

Data Sample Integrity

The Achilles’ heel of data sampling is the integrity of the data sample. A data sample can represent a universe of values only when the data sample is evenly distributed throughout the universe from which it is a sample. If a data sample is skewed such that it misrepresents the universe of values, the results drawn from that data sample will be equally skewed. The creation of a data sample, therefore, falls under the axiom of all information systems—”garbage in...garbage out.”

In the event a complete universe of data is not a feasible option, then utmost care must be taken to ensure that the sample data provided to the analysis truly has no bias or skew and is a true representation of the universe of data values. Statistical practices include random sampling methods, which provide a sample of data with no bias or skew. While it may be tempting, and even seem reasonable, to use a sample set of data that comes from a group or section of the enterprise already understood by the analyst, or at least handy and already available, data sampling based on such convenient methods will deliver skewed and questionable results at best. If a data sample must be used rather than the complete universe of values, statistical random sampling is the best way to generate a sample set of data.

Data Warehouse Data Structures

The tables and views in a data warehouse have been optimized for query performance. The queries for which they have been optimized are the BI Report queries, which occur frequently and can generate a high cost to the data warehouse when not optimized. You already know that the tables in your data warehouse have not been optimized for Market Basket Analysis; otherwise you would not be reading this book right now. That lack of optimization is caused by aspects peculiar to Market Basket Analysis that do not occur in normal BI Reporting. Those aspects include the flexibility of the Itemset, lack of control, and the recursive nature of Itemsets.

Flexibility of the Itemset

An Itemset can hold an almost infinite number of permutations of objects. The minimum number of objects in an Itemset is one. Without at least that first object, the Itemset does not exist. The maximum number of objects in an Itemset is limited only by the application that created the data and the application that stores the data. If the application that created the data has a limit of two thousand items in a transaction, then the maximum number of objects in an Itemset is that limit—two thousand. As such, an Itemset is an array. The procedural logic of a stored procedure or Cobol occurs statement works well with an array. Procedural logic is able to read through the array, each time cataloging each object in the array until it reads the end of the array. While reading through the array, procedural logic can keep a log of the other objects in the array and a tally of the number of times those other objects occurred in the array. Then, procedural logic can repeat that process beginning with the second object in the array. The third time through the array can begin with the third object in the array. The algorithm for reading through an array is to let the nth object be the Driver Object in the nth iteration through the array and let all other objects be the Correlation Object. In this way, procedural logic can read through the objects in an Itemset by treating the Itemset as an array.

Set logic, the basis of relational SQL, lacks the flexibility of procedural logic because set logic is based on a recurring set of rows of data. This works well if the rows of data fit in the same set definition. The flexibility of an Itemset, however, causes the data of an Itemset to not fit in a single set definition, as one row will have data in only one column while another row will have data in two hundred columns. Therefore, the set logic inherent in SQL does not work well due to the flexible number of objects in each Itemset.

The Market Basket Analysis solution design in Chapter 5 uses relational tables and SQL. However, the solution design in Chapter 5 breaks the array of an Itemset into a defined pattern of a Driver Object and a Correlation Object. That way, the solution design does not need to know how many objects are in the array. In the solution design the array is broken into pairs of objects and then the array completely disappears. In this way, the flexibility of the array of an Itemset is replaced by a predetermined definition of pairs of objects. Once the array of an Itemset is converted into a set of rows of data, each containing a pair of objects, the advantages of set logic and relational SQL are leveraged by the most basic of SQL statements—SUM and GROUP BY.

Lack of Control

The flexibility of an Itemset is an issue for Market Basket Analysis because the analyst cannot control the number of objects in an Itemset. An analyst cannot ask everyone who interacts with the enterprise to please limit their activity to five objects, no more, no less. On the contrary, the people who interact with the enterprise, each of them generating an Itemset, are simply going about their business with no thought of the Market Basket Analysis about to be performed on the data they are generating. So, an analyst seldom has any reliable way to influence the number of objects included in an Itemset.

Any attempt to force an Itemset to fit into a predefined Itemset definition will modify the data so significantly as to render any conclusions fallacious at best. The only way to discern the patterns of the enterprise is to first let those patterns happen, to release control of the Itemsets to the people who, by their interaction with the enterprise, generate the Itemsets of the enterprise.

Only after the Itemsets are allowed to occur naturally can those Itemsets be studied in earnest. The solution design in Chapter 5 is able to handle this lack of control because it first converts the Itemset with an undefined number of objects into a predefined set of rows. Each row contains a pair of objects. One object is the Driver Object. The other object is the Correlation Object. Because the relational SQL logic knows the recurring format of the rows of data, each containing two objects, the set logic of relational SQL is able to handle an undefined number of rows, each in a predefined format.

Recursive Nature of Market Basket Analysis

When studying an Itemset to discern the patterns surrounding the occurrence of Item X, the analyst does not know where in that Itemset Item X occurs. Unfortunately, the enterprise was not able to ask all the people who interacted with the enterprise to please use Item X first. Then, when the analyst is ready to study Item Y, the enterprise is not able to ask everyone to please come back and repeat their Itemset, but this time beginning with Item Y. So, again the analyst has no idea where in all the Itemsets the focal Item Y will occur. Then, when the analyst is ready to study Item Z...well, the problem continues.

In this way, Market Basket Analysis would pick up the first object in an Itemset and consider all the other objects in the context of that first object, but only if that first object is the focal object. If not, then the Market Basket Analysis would skip the first object and pick up the second object and consider all the other objects in the context of that second object, but only if that second object is the focal object. This process continues for every object in the Itemset, even if the focal object is found somewhere in the middle of the Itemset. The search for the focal object in the Itemset continues because the person generating the Itemset may have chosen to include an object twice. Logically then, Market Basket Analysis juxtaposes each object, one at a time in its own turn, against all the other objects in the Itemset. This is a recursion of the Itemset, juxtaposing each nth object in the Itemset with all other objects in the Itemset. By the way, an individual object may occur multiple times, and not in contiguous order, within a single Itemset. The logic that juxtaposes a Driver Object with all Correlation Objects must be able to resolve all the multiple occurrences of each object into a single occurrence of that object.

As mentioned previously, procedural logic in a stored procedure is able to perform a recursion by considering one Driver Object at a time. In each pass through the logic, procedural logic can keep a tally of the Correlation Objects as having occurred in the context of a Driver Object. Once that recursion is complete, procedural logic can then move on to the nth object until it reaches the end of the Itemset.

The solution design in Chapter 5 is able to accomplish the recursion between a Driver Object and all the other objects in an Itemset by first capturing each recursion as a row of data. Each row of data juxtaposes a Driver Object with a single Correlation Object. Then, the set logic of relational SQL is able to use a SUM and GROUP BY statement to count the number of occurrences of each Correlation Object that occurred in an Itemset within the context of a Driver Object. The same relational SQL is also able to sum any metrics or measurements (e.g., quantity, currency, points) associated with each Correlation Object.

On Your Mark...Get Set...Go!

The solution design in Chapter 5 cannot change the laws of data sampling. Unfortunately, the need to maintain the integrity of the data sample and simultaneously achieve statistical significance cannot be removed. However, the large data volumes necessary to deliver statistically significant and actionable information can become feasible by the solution design in Chapter 5. Rather than allow itself to become mired in recursions of arrays containing an undefined number of members, the solution design first converts the Itemset array into a set of rows. From that point on, the Market Basket Analysis application is able to achieve its results by summing the metrics and measurements in the columns and rows of that table. So, unless there are any questions before moving on...it’s on to the solution design for Market Basket Analysis.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.134.109.14