In this chapter, we will learn how to mine frequent patterns, association rules, and correlation rules when working with R programs. Then, we will evaluate all these methods with benchmark data to determine the interestingness of the frequent patterns and rules. We will cover the following topics in this chapter:
The algorithms to find frequent items from various data types can be applied to numeric or categorical data. Most of these algorithms have one common basic algorithmic form, which is A-Priori, depending on certain circumstances. Another basic algorithm is FP-Growth, which is similar to A-Priori. Most pattern-related mining algorithms derive from these basic algorithms.
With frequent patterns found as one input, many algorithms are designed to find association and correlation rules. Each algorithm is only a variation from the basic algorithm.
Along with the growth, size, and types of datasets from various domains, new algorithms are designed, such as the multistage algorithm, the multihash algorithm, and the limited-pass algorithm.
One popular task for data mining is to find relations among the source dataset; this is based on searching frequent patterns from various data sources, such as market baskets, graphs, and streams.
All the algorithms illustrated in this chapter are written from scratch in the R language for the purpose of explaining association analysis, and the code will be demonstrated using the standard R packages for the algorithms such as arules.
With many applications across a broad field, frequent pattern mining is often used in solving various problems, such as the market investigation for a shopping mall from the transaction data.
Frequent patterns are the ones that often occur in the source dataset. The dataset types for frequent pattern mining can be itemset, subsequence, or substructure. As a result, the frequent patterns found are known as:
These three frequent patterns will be discussed in detail in the upcoming sections.
These newly founded frequent patterns will serve as an important platform when searching for recurring interesting rules or relationships among the given dataset.
Various patterns are proposed to improve the efficiency of mining on a dataset. Some of them are as follows; they will be defined in detail later:
The frequent itemset originated from true market basket analysis. In a store such as Amazon, there are many orders or transactions; a certain customer performs a transaction where their Amazon shopping cart includes some items. The mass result of all customers' transactions can be used by the storeowner to find out what items are purchased together by customers. As a simple definition, itemset denotes a collection of zero or more items.
We call a transaction a
basket, and a set of items can belong to any basket. We will set the variable s
as the support threshold, which is compared with the count of a certain set of items that appear in all the baskets. If the count of a certain set of items that appear in all the baskets is not less than s
, we would call the itemset a frequent itemset.
An itemset is called a k-itemset if it contains k pieces of items, where k is a non-zero integer. The support count of an itemset is , the count of itemset contained X, given the dataset.
For a predefined minimum support threshold s, the itemset X
is a frequent itemset if . The minimum support threshold s
is a customizable parameter, which can be adjusted by domain experts or experiences.
The frequent itemset is also used in many domains. Some of them are shown in the following table:
Items |
Baskets |
Comments | |
---|---|---|---|
Related concepts |
Words |
Documents | |
Plagiarism |
Documents |
Sentences | |
Biomarkers |
Biomarkers and diseases |
The set of data about a patient |
If an itemset is frequent, then any of its subset must be frequent. This is known as the A-Priori principle, the foundation of the A-Priori algorithm. The direct application of the A-Priori principle is to prune the huge number of frequent itemsets.
One important factor that affects the number of frequent itemsets is the minimum support count: the lower the minimum support count, the larger the number of frequent itemsets.
For the purpose of optimizing the frequent itemset-generation algorithm, some more concepts are proposed:
Y
does not have frequent supersets.
The following example is of a transaction dataset. All itemsets only contain items from the set, .Let's assume that the minimum support count is 3.
tid (transaction id) |
List of items in the itemset or transaction |
---|---|
T001 | |
T002 | |
T003 | |
T004 | |
T005 | |
T006 | |
T007 | |
T008 | |
T009 | |
T010 |
Then, we will get the frequent itemsets and .
The frequent sequence is an ordered list of elements where each element contains at least one event. An example of this is the page-visit sequence on a site by the specific web page the user is on more concretely speaking, the order in which a certain user visits web pages. Here are two examples of the frequent subsequence:
The length of a sequence is defined by the number of items contained in the sequence. A sequence of length k is called a k-sequence. The size of a sequence is defined by the number of itemsets in the sequence. We call a sequence as a subsequence of the sequence or as the super sequence of when is satisfied.
In some domains, the tasks under research can be modeled with a graph theory. As a result, there are requirements for mining common subgraphs (subtrees or sublattices); some examples are as follows:
A graph G is represented by G = (V, E)
, where V
represents a group of vertices, and E
represents a group of edges. A graph is called as subgraph of graph G = (V, E)
once and . Here is an example of a subgraph. There is the original graph with vertices and edges on the left-hand side of the following figure and the subgraph on the right-hand side with some edges omitted (or omission of vertices in other circumstances):
Mining of association rules is based on the frequent patterns found. The different emphases on the interestingness of relations derives two types of relations for further research: association rules and correlation rules.
In a later section, a method to show association analysis is illustrated; this is a useful method to discover interesting relationships within a huge dataset. The relations can be represented in the form of association rules or frequent itemsets.
Association rule mining is to find the result rule set on a given dataset (the transaction data set or other sequence-pattern-type dataset), a predefined minimum support count s, and a predefined confidence c, given any found rule , and .
is an association rule where ; X and Y are disjoint. The interesting thing about this rule is that it is measured by its support and confidence. Support means the frequency in which this rule appears in the dataset, and confidence means the probability of the appearance of Y
when X
is present.
For association rules, the key measures of rule interestingness are rule support and confidence. Their relationship is given as follows:
support_count(X)
is the count of itemset in the dataset, contained X
.
As a convention, in support_count(X)
, in the confidence value and support count value are represented as a percentage between 0 and 100.
The association rule is strong once and . The predefined minimum support threshold is s, and c
is the predefined minimum confidence threshold.
The meaning of the found association rules should be explained with caution, especially when there is not enough to judge whether the rule implies causality. It only shows the co-occurrence of the prefix and postfix of the rule. The following are the different kinds of rules you can come across:
In some situations, the support and confidence pairs are not sufficient to filter uninteresting association rules. In such a case, we will use support count, confidence, and correlations to filter association rules.
There are a lot of methods to calculate the correlation of an association rule, such as analyses, all-confidence analysis, and cosine. For a k-itemset , define the all-confidence value of X as:
3.141.19.212