13.1 Introduction

We have discussed three major approaches for stream analytics in Chapters 10 through 12. In Chapter 10, we described our innovative technique for classifying concept-drifting data streams using a novel ensemble classifier originally discussed in [MASU09a]. It is a multiple partition, multiple chunk (MPC) ensemble classifier-based data mining technique to classify concept-drifting data streams. Existing ensemble techniques in classifying concept-drifting data streams follow a single-partition, single-chunk approach in which a single data chunk is used to train one classifier. In our approach, we train a collection of v classifiers from r consecutive data chunks using v-fold partitioning of the data, and build an ensemble of such classifiers. By introducing this MPC ensemble technique, we significantly reduce classification error compared to the single-partition, single-chunk ensemble approaches. We have theoretically justified the usefulness of our algorithm, and empirically proved its effectiveness over other state-of-the-art stream classification techniques on synthetic data and real botnet traffic.

In Chapter 11, we described a novel and efficient technique that can automatically detect the emergence of a novel class in the presence of concept drift by quantifying cohesion among unlabeled test instances and separation of the test instances from training instances. Our approach is nonparametric, meaning, it does not assume any underlying distributions of data. Comparison with the state-of-the-art stream classification techniques proves the superiority of our approach. In this chapter, we discuss our proposed framework for classifying data streams with automatic novel class detection mechanism. It is based on our previous work [MASU09a].

In Chapter 12, we discussed the building of a classification model from a training set having both unlabeled and a small number of labeled instances. This model is built as microclusters using a semisupervised clustering technique and classification is performed with k-nearest neighbor algorithm. An ensemble of these models is used to classify the unlabeled data. Empirical evaluation on both synthetic data and real botnet traffic reveals that our approach, using only a small amount of labeled data for training, outperforms state-of-the-art stream classification algorithms that use 20 times more labeled data than our approach. We describe our proposed solution for the limited labeled training data. It is based on our work discussed in [MASU08].

In this chapter, we will compare the three approaches we have developed. That is, we discuss the three data stream classification approaches described in previous chapters and give directions to possible extensions to those approaches. The organization of this chapter is as follows. A summary of the three approaches is provided in Section 13.2. Some extensions are discussed in Section 13.3. Summary and directions are provided in Section 13.4.

13.2 Discussion of the Approaches

We summarize each of our proposed works under the respective headings.

13.2.1 MPC Ensemble Approach

We have introduced an MPC ensemble method for classifying concept-drifting data streams. MPC ensemble approach keeps the best Lv classifiers, where a batch of v classifiers are trained with v overlapping partitions of r consecutive data chunks. It is a generalization over previous ensemble approaches that train a single classifier from a single data chunk. By introducing this MPC ensemble, we have reduced error significantly over the single-partition, single-chunk approach. We have proved our claims theoretically, tested MPC ensemble approach on both synthetic data and real botnet data, and obtained better classification accuracies compared to other approaches.

13.2.2 Classification and Novel Class Detection in Data Streams (ECSMiner)

We have addressed several real-world problems related to data stream classification. We have proposed a solution to the concept-evolution problem, which has been ignored by most of the existing data stream classification techniques. Existing data stream classification techniques assume that the total number of classes in the stream is fixed. Therefore, when a novel class appears in the stream, instances belonging to that class are misclassified by the existing techniques. We show how to detect novel classes automatically even when the classification model is not trained with the novel class instances. Novel class detection becomes more challenging in the presence of concept drift.

Existing novel class detection techniques have limited applicability, since those are similar to one-class classifiers. That is, they assume that there is only one normal class, and all other classes are novel. However, ECSMiner is applicable to the more realistic scenario where there is more than one existing class in the stream. Besides, our novel class detection technique is nonparametric and it does not require any specific data distribution, nor does it require the classes to have convex shape. We have also shown how to effectively classify stream data under different time constraints. ECSMiner outperforms the state-of-the art data stream-based classification techniques in both classification accuracy and processing speed. We believe that our proposed technique will inspire more research toward solving real-world stream classification problems.

It might appear to readers that in order to detect novel classes, we are in fact examining whether new clusters are being formed, and therefore, the detection process could go on without supervision. But supervision is necessary for classification. Without external supervision, two separate clusters could be regarded as two different classes, although they are not. Conversely, if more than one novel class appears simultaneously, all of them could be regarded as a single novel class if the labels of those instances are never revealed.

13.2.3 Classification with Scarcely Labeled Data (ReaSC)

We address a more realistic problem of stream mining: training with a limited amount of labeled data. ReaSC is a more practical approach to the stream classification problem since it requires a lower amount of labeled data, saving much time and cost that would be otherwise required to manually label the data. Previous approaches for stream classification did not address this vital problem.

We propose and implement a semisupervised clustering-based stream classification algorithm to solve this limited labeled-data problem. We show that ReaSC, using much fewer labeled training instances than other stream classification techniques, works better than those techniques. We evaluate ReaSC on two synthetically generated datasets and two real datasets, and achieve better classification accuracies than state-of-the-art stream classification approaches in all datasets.

13.3 Extensions

Our proposed data stream classification techniques can be extended in various ways. The first and obvious extension would be developing a unified framework that integrates all three proposed techniques. We consider the novel class detection and limited labeled data problems separately. In the unified framework, the classification model should be able to detect the novel class even if there is only a few labeled data per chunk. It would be interesting to see how the scarcity of labeled data affects the proposed novel class detection technique. All of the following extensions are applicable to this unified framework.

Dynamic Feature Vector: We would like to address the data stream classification problem under dynamic feature vector. Currently, we assume that the feature vector is fixed. We would like to relax this assumption, and provide a more general framework for data stream classification where the feature vector may change over time. This would be useful for classifying text stream, and other similar data streams, where new features evolve over time. For example, suppose the data stream is a stream of text documents where the feature vector consists of a list of all words that appeared so far in the stream. Since the vocabulary is supposed to increase with the arrival of new documents, the feature vector also grows over time. As a result, each classifier in the ensemble would be built on a different feature vector. Also, a test instance would likely have a different feature vector from the training instances of a classifier. It would be a challenging task to classify the test instances and detect novel classes in these scenarios.

Multilabel Instances: In some classification problems, an instance may have multiple class labels, which are known as multilabel instances. The multilabel classification problem is more generalized than the traditional classification problem where each instance belongs to only one class. An example of this multilabel instance is the NASA Aviation Safety Reporting Systems dataset [NASA] which has been discussed in detail in Section 12.5. Although there are many existing solutions to the multilabel classification problem ([CHEN07], [CLAR01], [GAOS04], [LUO05] [MCDO05] [STRE08] [THAB04], [VELO07], [ZHAN07]), none of them are applicable to data streams. Besides, the problem becomes more complicated in the presence of concept drift and novel classes. It will be interesting to investigate the effects of multilabel instances on our proposed stream classification technique and extend it to cope with them.

Cloud Computing and Big Data: With so much streaming data being generated for various applications, we need to develop scalable stream data mining techniques that can handle massive amounts of data. Therefore, we need to extend our techniques to operate in a cloud computing framework. As data streams grow larger and faster, it would be difficult to mine them using a stand-alone computing machine and limited data storage. Therefore, in the future we would need to utilize the computing power and storage capacities from several computing machines. This would necessitate adapting our proposed technique on the cloud computing infrastructure. One such infrastructure is the Hadoop Distributed File System ([CHU07], [DEAN08], [HUSA09]). In order to facilitate classification, raw data will be distributed among different nodes in the system. Hadoop is a distributed file system and stream data will be stored in this system by partitioning each data chunk into a number of blocks and storing each block in a separate node as illustrated in Figure 13.1.

79026.jpg

Figure 13.1 Architecture for stream data classification using Hadoop Distributed File System.

For example, without loss of generality, let us assume that a node can hold at least 128 MB data on hard disk and block size is 64 MB. If chunk size is 256 MB, the chunk will be partitioned into four data blocks. These four data blocks will be distributed across two nodes (without replication). On the other hand, if the chunk size is 64 MB, the whole chunk can be stored into a single node. Hence, a chunk will be processed in parallel by each node independently, which can speed up query processing and classification. Each node will train its own classification model from the raw data that is stored in that node. After training, raw data will be discarded. However, there should be a way to combine the classification output of each node to get a global output, such as majority voting.

In addition to using a cloud computing framework, we also need to explore the use of the big data management and analytics (BDMA) technologies discussed in Chapter 7. For example, the streaming data may need to be stored in systems such as HBase and CouchDB. Furthermore, the enhanced Weka techniques discussed in Chapter 7 to handle big data have to be examined for stream mining for big data.

Dynamic Chunk Size and Ensemble Size: In the proposed approach, we use fixed chunk size and ensemble size. In the future, we would like to make the system more adaptive to concept drift and concept evolution by adapting both the chunk size and ensemble size. It is shown in the past that if the chunk size is increased when concept drift is slow, and decreased when it is faster, it is possible to improve classification performance ([KLIN00], [WIDM96]). Also, keeping older classifiers in the ensemble hurts performance when concept drift and concept evolution occur too fast, and improve performance when they are slow. Therefore, dynamically changing the ensemble size also adds to the overall improvement in classification accuracy. However, the main challenge is to determine whether concept drift and concept evolution are occurring and at what pace ([GARC06], [DRIE09], [GAMA04], [HIDO08]).

Parameter Reduction: We also have several other parameters that need to be tuned to optimum performance, for example, the number of clusters, K, and the minimum neighborhood size, q. In the future, we would like to make these parameters adapt to the dynamic nature of the data streams and change their values accordingly.

Real-Time Classification: We would like to extend our system to perform real-time stream classification. Real-time data stream classification is important in many applications such as the intrusion detection systems. We need to optimize both classification and training times for the system to be applicable to a real-time environment. In order to get a fast and accurate classification decision, we need to consider a number of issues. First, an instance must be classified using a fast classification model. Second, the classification model must be updated quickly, and also the update should not delay the classification of a test instance. Third, the system should be able to extract features from raw data quickly, and present the feature vector to the classification model.

Feature Weighting: We would like to incorporate feature weighting and distance learning in the semisupervised clustering, which should lead to a better classification model. Feature weighting and distance learning have been used in many semisupervised computing tasks in the past ([BASU04], [BASU06], [BILE04]). However, the learning process would be more challenging in a streaming environment in the presence of concept drift and novel classes.

13.4 Summary and Directions

This chapter has examined the three approaches we have developed for data stream classification and discussed directions for further work. In particular, we need to enhance the algorithms by providing greater accuracy and fewer false positives and negatives. Furthermore, we need to enhance the performance of the algorithms to handle massive amounts of data. Towards this end, we believe that a cloud-based implementation is a viable approach for high-performance data analytics. In addition, the BDMA technologies that we discussed in Chapter 7 need to be examined for stream mining for big data.

In Section III of this book, we will discuss an application of stream data analytics and that is for insider threat detection. Some other systems that utilize stream mining with big data analytics including a discussion of malware detection systems in the cloud will be discussed in Section IV.

References

[BASU04]. S. Basu, A. Banerjee, R.J. Mooney, “Active Semi-Supervision for Pairwise Constrained Clustering,” SDM ’04: Proceedings of the 2004 SIAM International Conference on Data Mining, April 22–24, Lake Buena Vista, FL, pp. 333–344, SIAM, 2004.

[BASU06]. S. Basu, M. Bilenko, A. Banerjee, R.J. Mooney, “Probabilistic Semi-Supervised Clustering with Constraints,” Semi-Supervised Learning, O. Chapelle, B. Schoelkopf, A. Zien, editors, MIT Press, Cambridge, MA, 73–102, 2006.

[BILE04]. M. Bilenko, S. Basu, R.J. Mooney, “Integrating Constraints and Metric Learning in Semi-Supervised Clustering,” ICML ’04: Proceedings of the Twenty-First International Conference on Machine Learning, July 4–8, Banff, Canada, pp. 81–88, Morgan Kaufmann, 2004.

[CHEN07]. W. Chen, J. Yan, B. Zhang, Z. Chen, Q. Yang, “Document Transformation for Multi-Label Feature Selection in Text Categorization,” ICDM ’07: Proceedings of the 2007 International Conference on Data Mining, October 28–31, Omaha, NE, pp. 451–456, IEEE Computer Society, 2007.

[CHU07]. C.-T. Chu, S.K. Kim, Y.-A. Lin, Y.Y. Yu, G. Bradski, A.Y. Ng, “Map-Reduce for Machine Learning on Multicore,” NIPS ’07: Proceedings of the 21st Annual Conference on Neural Information Processing Systems, December 3–7, Vancouver, B.C., Canada, pp. 281–288, MIT Press, 2007.

[CLAR01]. A. Clare and R.D. King, “Knowledge Discovery in Multi-Label Phenotype Data,” PKDD ’01: Proceedings of the 5th European Conference on Principles of Data Mining and Knowledge Discovery, September 3–7, Freiburg, Germany, pp. 42–53, Springer-Verlag, 2001.

[DEAN08]. J. Dean and S. Ghemawat, “Mapreduce: Simplified Data Processing on Large Clusters,” Communications of the ACM, 51(1):107–113, January 2008.

[DRIE09]. A. Dries and U. Rückert, “Adaptive Concept Drift Detection,” SDM 09: Proceedings of the 2009 Siam International Conference on Data Mining, April 30 to May 2, Sparks, NV, pp. 233–244, SIAM, 2009.

[GAMA04]. J. Gama, P. Medas, G. Castillo, P.P. Rodrigues, “Learning with Drift Detection,” SBIA ’04: Proceedings of the 17th Brazilian Symposium on Artificial Intelligence (SBIA), September 29 to October 1, Sao Luis, Maranhao, Brazil, pp. 286–295, Springer, 2004.

[GAOS04]. S.G. Gaosheng, W. Wu, C.H. Lee, “A MFoM Learning Approach to Robust Multiclass Multi-Label Text Categorization,” ICML ’04: Proceedings of the 21st International Conference on Machine Learning, July 4–8, Banff, Canada, pp. 329–336, Morgan Kaufmann, 2004.

[GARC06]. M. Baena-Garcia, J. del Campo-Avila, R. Fidalgo, A. Bifet, R. Gavalda, R. Morales-Bueno, “Early Drift Detection Method,” ECML PKDD 2006 Workshop on Knowledge Discovery from Data Streams, September 18, Berlin, Germany, Springer-Verlag, 2006.

[HIDO08]. S. Hido, T. Ide, H. Kashima, H. Kubo, H. Matsuzawa, “Unsupervised Change Analysis Using Supervised Learning,” Advances in Knowledge Discovery and Data Mining, 148–159, 2008.

[HUSA09]. M.F. Husain, P. Doshi, L. Khan, B. Thuraisingham, “Storage and Retrieval of Large RDF Graph Using Hadoop and Mapreduce,” Technical Report No. UTDCS-40-09, Computer Science Department, University of Texas, Dallas, TX, 2009.

[KLIN00]. R. Klinkenberg and T. Joachims, “Detecting Concept Drift with Support Vector Machines,” ICML ’00: Proceedings of the 17th International Conference on Machine Learning, June 29 to July 2, Stanford University, CA, pp. 487–494, Morgan Kaufmann, 2000.

[LUO05]. X. Luo and A. Nur Zincir-Heywood, “Evaluation of Two Systems on Multi-Class Multi-Label Document Classification,” ISMIS ’05: Proceedings of the 15th International Symposium on Methodologies for Intelligent Systems, Saratoga Springs, New York, May 25–28, pp. 161–169, Springer, 2005.

[MCDO05]. R. Mcdonald, K. Crammer, F. Pereira, “Flexible Text Segmentation with Structured Multilabel Classification,” HLT-EMNLP ’05: Proceedings of the 2005 Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing, October 6–8, Vancouver, B.C., Canada, pp. 987–994, 2005.

[MASU08]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data,” ICDM ’08: Proceedings of the 2008 International Conference on Data Mining, December 15–19, Pisa, Italy, pp. 929–934, IEEE Computer Society, 2008.

[MASU09]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Integrating Novel Class Detection with Classification for Concept-Drifting Data Streams,” ECML PKDD ’09: Proceedings of the 2009 European Conference on Machine Learning and Principles and Practice in Knowledge Discovery in Databases, Vol. II, September 7–11, Bled, Slovenia, pp. 79–94, Springer-Verlag, 2009.

[NASA]. NASA Aviation Safety Reporting System, http://akama.arc.nasa.gov/ASRSDBOnline/QueryWizard_Begin.aspx.

[STRE08]. A.P. Streich and J.M. Buhmann, “Classification of Multi-Labeled Data: A Generative Approach,” ECML PKDD ’08: Proceedings of the 2008 European Conference on Machine Learning and Principles and Practice in Knowledge Discovery in Databases, Vol. II, September 15–19, Antwerp, Belgium, pp. 390–405, Springer, 2008.

[THAB04]. F.A. Thabtah, P. Cowling, Y. Peng,” Mmac: A New Multi- Class, Multi-Label Associative Classification Approach,” ICDM ’05: Proceedings of the 5th IEEE International Conference on Data Mining, November 1–4, Brighton, UK, pp. 217–224, IEEE Computer Society, 2004.

[VELO07]. A. Veloso, W. Meira Jr, M. Goncalves, M. Zaki, “Multi-Label Lazy Associative Classification,” ECML PKDD ’07: Proceedings of the 2007 European Conference on Machine Learning and Principles and Practice in Knowledge Discovery in Databases, September 17–21, Warsaw, Poland, pp. 605–612, Springer-Verlag, 2007.

[WIDM96]. G. Widmer and M. Kubat, “Learning in the Presence of Concept Drift and Hidden Contexts,” Machine Learning, 23(1):69–101, 1996.

[ZHAN07]. M.-L. Zhang and Z.-H. Zhou, “Multi-Label Learning by Instance Differentiation,” AAAI-07: Proceedings of the 22nd Conference on Artificial Intelligence, July 22–26, Vancouver, British Columbia, Canada, pp. 669674, 2007.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.186.164