In this chapter, you will learn how to write mining codes for stream data, time-series data, and sequence data.
The characteristics of stream, time-series, and sequence data are unique, that is, large and endless. It is too large to get an exact result; this means an approximate result will be achieved. The classic data-mining algorithm should be extended, or a new algorithm needs to be designed for this type of the dataset.
In relation to the mining of stream, time-series, and sequence data, there are some topics we can't avoid. They are association, frequent pattern, classification and clustering algorithms, and so on. In the following sections, we will go through these major topics.
In this chapter, we will cover the following topics;
As we mentioned in the previous chapters, one kind of data source always requires a variety of predefined algorithms or a brand new algorithm to deal with. Streaming data behaves a bit different from a traditional dataset.
The streaming dataset comes from various sources in modern life, such as credit record transaction stream, web feeds, phone-call records, sensor data from a satellite or radar, network traffic data, a security event's stream, and a long running list of various data streams.
The targets to stream data processing are, and not limited to, summarization of the stream to specific extents.
With the characteristics of streaming data, the typical architecture to stream a management system is illustrated in the following diagram:
The STREAM algorithm is a classical algorithm used to cluster stream data. In the next section, the details are present and explained by R code.
The summarized pseudocode of the STREAM algorithm are as follows:
In the preceding algorithm, LOCALSEARCH is a revised k-median algorithm.
Please take a look at the R codes file ch_08_stream.R
from the bundle of R codes for previously mentioned algorithms. The codes can be tested with the following command:
> source("ch_08_stream.R")
Day by day, the growing e-commerce market is driving the growth of the usage of credit card, which, in turn, is bringing in a large number of transaction streams. Fraudulent usage of credit cards happens every day; we want the algorithm to detect this kind of transaction in a very short time compared to the big volume of transactions. Besides this, the requirement to find out the valuable customers by an analysis of the transaction streams is becoming more and more important. However, it is harder to get valid information in a very short time in response to the advertisement needs, such as recommend the necessary finance services or goods to these customers.
The credit card transaction flow is also the process to generate stream data, and the related stream-mining algorithm can be applied with high accuracy.
One application for the credit card transaction flow mining is the behavior analysis of the consumers. With each transaction record that is stored, various costs or purchases of a certain person can be tracked. With the mining of such transaction records, we can provide an analysis to the credit card holder to help them keep financial balance or other financial target. We can also provide this analysis to the card issuer that published the credit card to create a new business, such as funding or loaning, or to the retailers (the merchants) such as Walmart to help in the arrangement of appropriate goods.
Another application for the credit card transaction flow mining is fraud detection. It is the most obvious one among tons of applications. Using this solution, the card issuer or the bank can reduce the rate of successful fraud.
The dataset of the credit card transaction includes the owner of cards, place of consumption, date, cost, and so on.
3.142.135.34