Association rule learning

Association rule learning has been a popular approach to discover interesting relations hips between items in large databases. It is most commonly applied in retail to reveal regularities between products.

Asociation rule learning approaches find patterns as interesting strong rules in the database using different measures of interestingness. For example, the following rule would indicate that if a customer buys onions and potatoes together, they are likely to also buy hamburger meat: {onions, potatoes} -> {burger}

Another classic story probably told in every machine learning class is the beer and diaper story. An analysis of supermarket shoppers' behavior showed that customers, presumably young men, who buy diapers tend also to buy beer. It immediately became a popular example of how an unexpected association rule might be found from everyday data; however, there are varying opinions as to how much of the story is true. Daniel Powers says (DSS News, 2002):

"In 1992, Thomas Blischok, manager of a retail consulting group at Teradata, and his staff prepared an analysis of 1.2 million market baskets from about 25 Osco Drug stores. Database queries were developed to identify affinities. The analysis "did discover that between 5:00 and 7:00 p.m. consumers bought beer and diapers". Osco managers did NOT exploit the beer and diapers relationship by moving the products closer together on the shelves."

In addition to the preceding example from MBA, association rules are today employed in many application areas, including web usage mining, intrusion detection, continuous production, and bioinformatics. We'll take a closer look at these areas later in this chapter.

Basic concepts

Before we dive into algorithms, let's first review the basic concepts.

Database of transactions

In association rule mining, the dataset is structured a bit differently than the approach presented in the first chapter. First, there is no class value, as this is not required for learning association rules. Next, the dataset is presented as a transactional table, where each supermarket item corresponds to a binary attribute. Hence, the feature vector could be extremely large.

Consider the following example. Suppose we have four receipts as shown in the following image. Each receipt corresponds to a purchasing transaction:

Database of transactions

To write these receipts in the form of a transactional database, we first identify all the possible items that appear in the receipts. These items are onions, potatoes, burger, beer, and dippers. Each purchase, that is, transaction, is presented in a row, and there is 1 if an item was purchased within the transaction and 0 otherwise, as shown in the following table:

Transaction ID

Onions

Potatoes

Burger

Beer

Dippers

1

0

1

1

0

0

2

1

1

1

1

0

3

0

0

0

1

1

4

1

0

1

1

0

This example is really small. In practical applications, the dataset often contains thousands or millions of transactions, which allow learning algorithm the discovery of statistically significant patterns.

Itemset and rule

Itemset is simply a set of items, for example, {onions, potatoes, burger}. A rule consists of two itemsets, X and Y, in the following format X -> Y.

This indicates a pattern that when the X itemset is observed, Y is also observed. To select interesting rules, various measures of significance can be used.

Support

Support, for an itemset, is defined as the proportion of transactions that contain the itemset. The {potatoes, burger} itemset in the previous table has the following support as it occurs in 50% of transactions (2 out of 4 transactions) supp({potatoes, burger }) = 2/4 = 0.5.

Intuitively, it indicates the share of transactions that support the pattern.

Confidence

Confidence of a rule indicates its accuracy. It is defined as

Confidence

.

For example, the {onions, burger} -> {beer} rule has the confidence 0.5/0.5 = 1.0 in the previous table, which means that 100% of the times when onions and burger are bought together, beer is bought as well.

Apriori algorithm

Apriori algorithm is a classic algorithm used for frequent pattern mining and association rule learning over transactional. By identifying the frequent individual items in a database and extending them to larger itemsets, Apriori can determine the association rules, which highlight general trends about a database.

Apriori algorithm constructs a set of itemsets, for example, itemset1= {Item A, Item B}, and calculates support, which counts the number of occurrences in the database. Apriori then uses a bottom-up approach, where frequent itemsets are extended, one item at a time, and it works by eliminating the largest sets as candidates by first looking at the smaller sets and recognizing that a large set cannot be frequent unless all its subsets are. The algorithm terminates when no further successful extensions are found.

Although, Apriori algorithm is an important milestone in machine learning, it suffers from a number of inefficiencies and tradeoffs. In the following section, we'll look into a more recent FP-growth technique.

FP-growth algorithm

FP-growth, where frequent pattern (FP), represents the transaction database as a prefix tree. First, the algorithm counts the occurrence of items in the dataset. In the second pass, it builds a prefix tree, an ordered tree data structure commonly used to store a string. An example of prefix tree based on the previous example is shown in the following diagram:

FP-growth algorithm

If many transactions share most frequent items, prefix tree provides high compression close to the tree root. Large itemsets are grown directly, instead of generating candidate items and testing them against the entire database. Growth starts at the bottom of the tree, by finding all the itemsets matching minimal support and confidence. Once the recursive process has completed, all large itemsets with minimum coverage have been found and association rule creation begins.

FP-growth algorithms have several advantages. First, it constructs an FP-tree, which encodes the original dataset in a substantially compact presentation. Second, it efficiently builds frequent itemsets, leveraging the FP-tree structure and divide-and-conquer strategy.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.189.171.125