14.1 Introduction

There is a growing consensus within the intelligence community that malicious insiders are perhaps the most potent threats to information assurance in many or most organizations ([BRAC04], [HAMP99], [MATZ04], [SALE11]). One traditional approach to the insider threat detection problem is supervised learning, which builds data classification models from training data. Unfortunately, the training process for supervised learning methods tends to be time-consuming and expensive, and generally requires large amounts of well-balanced training data to be effective. In our experiments, we observe that <3% of the data in realistic datasets for this problem is associated with insider threats (the minority class) and over 97% of the data is associated with nonthreats (the majority class). Hence, traditional support vector machines (SVM) ([CHAN11], [MANE02]), trained from such imbalanced data are likely to perform poorly on test datasets.

One-class SVMs (OCSVM) [MANE02] address the rare-class issue by building a model that considers only normal data (i.e., nonthreat data). During the testing phase, test data is classified as normal or anomalous based on geometric deviations from the model. However, the approach is only applicable to bounded-length, static data streams. In contrast, insider threat-related data is typically continuous and threat patterns evolve over time. In other words, the data is a stream of unbounded length. Hence, effective classification models must be adaptive (i.e., able to cope with evolving concepts) and highly efficient in order to build the model from large amounts of evolving data. Data, that is, associated with insider threat detection and classification is often continuous. In these systems, the patterns of average users and insider threats can gradually evolve. A novice programmer can develop his skills to become an expert programmer over time. An insider threat can change his actions to more closely mimic legitimate user processes. In either case, the patterns at either end of these developments can look drastically different when compared directly to each other. These natural changes will not be treated as anomalies in our approach. Instead, we classify them as natural concept drift. The traditional static supervised and unsupervised methods raise unnecessary false alarms with these cases because they are unable to handle them when they arise in the system. These traditional methods encounter high false positive rates (FPR). Learning models must be adept in coping with evolving concepts and highly efficient at building models from large amounts of data to rapidly detecting real threats. For these reasons, the insider threat problem can be conceptualized as a stream mining problem that applies to continuous data streams. Whether using a supervised or unsupervised learning algorithm, the method chosen must be highly adaptive to correctly deal with concept drifts under these conditions. Incremental learning and ensemble-based learning ([MASU10a], [MASU10b] [MASU11a], [MASU11b], [MASU08], [MASU13], [MASU11c], [ALKH12a], [MASU11d], [ALKH12b]) are two adaptive approaches in order to overcome this hindrance. An ensemble of K models that collectively vote on the final classification can reduce the false negatives and false positives for a test set. As new models are created and old ones are updated to be more precise, the least accurate models are discarded to always maintain an ensemble of exactly K current models. An alternative approach to supervised learning is unsupervised learning, which can be effectively applied to purely unlabeled data—that is, data in which no points are explicitly identified as anomalous or nonanomalous. Graph-based anomaly detection (GBAD) is one important form of unsupervised learning ([COOK07], [EBER07], [COOK00]) but has traditionally been limited to static, finite-length datasets. This limits its application to streams related to insider threats which tend to have unbounded length and threat patterns that evolve over time. Applying GBAD to the insider threat problem therefore requires that the models used be adaptive and efficient. Adding these qualities allows effective models to be built from vast amounts of evolving data.

In this book, we cast insider threat detection as a stream mining problem and propose two methods (supervised and unsupervised learning) for efficiently detecting anomalies in stream data [PARV13]. To cope with concept evolution, our supervised approach maintains an evolving ensemble of multiple OCSVM models [PARV11a]. Our unsupervised approach combines multiple GBAD models in an ensemble of classifiers [PARV11b]. The ensemble updating process is designed in both cases to keep the ensemble current as the stream evolves. This evolutionary capability improves the classifier’s survival of concept drift as the behavior of both legitimate and illegitimate agents varies over time. In experiments, we use test data that records system call data for a large, Unix-based, multiuser system.

This chapter deserves our approach to insider threat detection using stream data mining. In Section 14.2, we discuss sequence stream data. Big data issues are discussed in Section 14.3. Our contributions are discussed in Section 14.4. This chapter is summarized in Section 14.5.

14.2 Sequence Stream Data

The above approach may not work well for sequence data ([PARV12a], [PARV12b]). For sequence data, our approach maintains an ensemble of multiple unsupervised stream-based sequence learning (USSL) [PARV12a]. During the learning process, we store the repetitive sequence patterns from a user’s actions or commands in a model called a quantized dictionary. In particular, longer patterns with higher weights due to frequent appearances in the stream are considered in the dictionary. An ensemble in this case is a collection of K models of type quantized dictionary. When new data arrives or is gathered, we generate a new quantized dictionary model from this new dataset. We will take the majority voting of all models to find the anomalous pattern sequences within this new dataset. We will update the ensemble if the new dictionary outperforms others in the ensemble and will discard the least accurate model from the ensemble. Therefore, the ensemble always keeps the models current as the stream evolves, preserving high detection accuracy as both legitimate and illegitimate behaviors evolve over time. Our test data consists of real-time-recorded user command sequences for multiple users of varying experience levels and a concept-drift framework to further exhibit the practicality of this approach.

14.3 Big Data Issues

Quantized dictionary construction is time-consuming. Scalability is a bottleneck here. We exploit distributed computing to address this issue. There are two ways we can achieve this goal. The first one is parallel computing with shared memory architecture that exploits expensive hardware. The latter approach is distributing computing with shared nothing architecture that exploits commodity hardware. For our case, we exploit the latter choice. Here, we use a MapReduce-based framework to facilitate quantization using Hadoop Distributed File System (HDFS). We propose a number of algorithms to quantize dictionary. For each of them we discuss the pros and cons and report performance results on a large dataset.

It should be noted that there are several directions for further work on applying big data technologies. For example, in addition to the Hadoop/MapReduce framework, we also need to examine the use of Spark and Storm technologies. Also, the big data management and analytics systems discussed in Chapter 7 have to be examined for developing scalable stream data analytics techniques for insider threat detection.

14.4 Contributions

The main contributions of this work can be summarized as follows (see Figure 14.1).

79023.jpg

Figure 14.1 Contribution in visual form.

1.We show how stream mining can be effectively applied to detect insider threats.

2.With regard to nonsequence data:

a.We propose a supervised learning solution that copes with evolving concepts using one-class SVMs.

b.We increase the accuracy of the supervised approach by weighting the cost of false negatives.

c.We propose an unsupervised learning algorithm that copes with changes based on GBAD.

d.We effectively address the challenge of limited labeled training data (rare instance issues).

e.We exploit the power of stream mining and graph-based mining by effectively combining the two in a unified manner. This is the first work to our knowledge to harness these two approaches for insider threat detection.

f.We compare one and two class SVMs on how well they handle stream insider threat problems.

g.We compare supervised and unsupervised stream learning approaches and show which has superior effectiveness using real-world data.

3.With regard to sequence data:

a.For sequence data, we propose a framework that exploits an unsupervised learning (USSL) to find pattern sequences from successive user actions or commands using stream-based sequence learning.

b.We effectively integrate multiple USSL models in an ensemble of classifiers to exploit the power of ensemble-based stream mining and sequence mining.

c.We compare our approach with the supervised model for stream mining and show the effectiveness of our approach in terms of true positive rate (TPR) and FPR on a benchmark dataset.

4.With regard to big data:

a.Scalability is an issue to construct benign pattern sequences for quantized dictionary. For this, we exploit the MapReduce-based framework and show effectiveness of our work.

14.5 Summary and Directions

Our approach is to define the insider threat detection as a stream mining problem and propose two methods (supervised and unsupervised learning) for efficiently detecting anomalies in stream data. To cope with concept evolution, our supervised approach maintains an evolving ensemble of multiple OCSVM models. Our unsupervised approach combines multiple GBAD models in an ensemble of classifiers. The ensemble updating process is designed in both cases to keep the ensemble current as the stream evolves. This evolutionary capability improves the classifier’s survival of concept drift as the behavior of both legitimate and illegitimate agents varies over time. In the experiments, we use test data that records system call data for a large, Unix-based, multiuser system.

This chapter has provided an overview of our approach to insider threat detection using stream analytics and discussed the big data issue with respect to the problem. That is, massive amounts of stream data are emanating from various devices and we need to analyze this data for insider threat detection. We essentially adapt the techniques discussed in Section II for insider threat detection. These techniques are discussed in the ensuing chapters of Section III.

References

[ALKH12a]. T. Al-Khateeb, M.M. Masud, L. Khan, C.C. Aggarwal, J. Han, B.M. Thuraisingham, “Stream Classification with Recurring and Novel Class Detection Using Class-Based Ensemble,” ICDM, Brussels, Belgium, pp. 31–40, 2012.

[ALKH12b]. T. Al-Khateeb, M.M. Masud, L. Khan, B.M. Thuraisingham, “Cloud Guided-Stream Classification Using Class-Based Ensemble,” IEEE CLOUD, Honolulu, Hawaii, pp. 694–701, 2012.

[BRAC04]. R.C. Brackney and R.H. Anderson (editors). Understanding the Insider Threat. RAND Corporation, Santa Monica, CA, 2004.

[CHAN11]. C.-C. Chang and C.-J. Lin, “LIBSVM: A Library for Support Vector Machines,” ACM Transactions on Intelligent Systems and Technology, 2(3), 2011, Article #27.

[COOK00]. D.J. Cook and L.B. Holder, “Graph-Based Data Mining,” IEEE Intelligent Systems, 15(2), 32–41, 2000.

[COOK07]. D.J. Cook and L.B. Holder, (Eds.). Mining Graph Data. John Wiley & Sons, Inc., Hoboken, NJ, 2007.

[EBER07]. W. Eberle and L.B. Holder, “Mining for Structural Anomalies in Graph-Based Data,” In Proceedings of International Conference on Data Mining (DMIN), Las Vegas, NV, pp. 376–389, 2007.

[HAMP99]. M.P. Hampton and M. Levi, “Fast Spinning into Oblivion? Recent Developments in Money-Laundering Policies and Offshore Finance Centres,” Third World Quarterly, 20(3), 645–656, 1999.

[MANE02]. L.M. Manevitz and M. Yousef, “One-Class SVMs for Document Classification,” The Journal of Machine Learning Research, 2, 139–154, 2002.

[MASU08]. M.M. Masud, J. Gao, L. Khan, J. Han, B. Thuraisingham, “A Practical Approach to Classify Evolving Data Streams: Training with Limited Amount of Labeled Data,” In Proceedings of IEEE International Conference on Data Mining (ICDM), Pisa, Italy, pp. 929–934, 2008.

[MASU10a]. M.M. Masud, Q. Chen, J. Gao, L. Khan, C. Aggarwal, J. Han, B. Thuraisingham, “Addressing Concept-Evolution in Concept-Drifting Data Streams,” In Proceedings of IEEE International Conference on Data Mining (ICDM), pp. 929–934, 2010.

[MASU10b]. M.M. Masud, Q. Chen, J. Gao, L. Khan, J. Han, B. M. Thuraisingham, “Classification and Novel Class Detection of Data Streams in A Dynamic Feature Space,” CML/PKDD (2), pp. 337–352, 2010.

[MASU11a]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Classification and Novel Class Detection in Concept-Drifting Data Streams Under Time Constraints,” IEEE Transactions on Knowledge and Data Engineering, 23(6), 859–874, 2011.

[MASU11b]. M.M. Masud, J. Gao, L. Khan, J. Han, B.M. Thuraisingham, “Classification and Novel Class Detection in Concept-Drifting Data Streams under Time Constraints,” IEEE Transactions on Knowledge and Data Engineering, 23(6), 859–874, 2011.

[MASU11c]. M.M. Masud, C. Woolam, J. Gao, L. Khan, J. Han, K.W. Hamlen, N.C. Oza, “Facing The Reality of Data Stream Classification: Coping with Scarcity of Labeled Data,” Knowledge and Information Systems, 33(1), 213–244, 2011.

[MASU11d]. M.M. Masud, T. Al-Khateeb, L. Khan, C.C. Aggarwal, J. Gao, J. Han, B.M. Thuraisingham, “Detecting Recurring and Novel Classes in Concept-Drifting Data Streams,” ICDM, pp. 1176–1181, 2011.

[MASU13]. M.M. Masud, Q. Chen, L. Khan, C.C. Aggarwal, J. Gao, J. Han, A.N. Srivastava, N.C. Oza, “Classification and Adaptive Novel Class Detection of Feature-Evolving Data Streams,” IEEE Transactions on Knowledge and Data Engineering, 25(7), 1484–1497, 2013.

[MATZ04]. S. Matzner and T. Hetherington, “Detecting Early Indications of A Malicious Insider,” IA Newsletter, 7(2), 42–45, 2004.

[PARV11a]. P. Parveen, J. Evans, B. Thuraisingham, K.W. Hamlen, L. Khan, “Insider Threat Detection Using Stream Mining and Graph Mining,” In Proceedings of the 3rd IEEE Conference on Privacy, Security, Risk and Trust (PASSAT) MIT, October, Boston, MA. (acceptance rate 8%) (Nominated for Best Paper Award), pp. 1102–1110, 2011.

[PARV11b]. P. Parveen, Z.R. Weger, B. Thuraisingham, K.W. Hamlen, L. Khan, “Supervised Learning for Insider Threat Detection Using Stream Mining,” In Proceedings of the 23rd IEEE International Conference on Tools with Artificial Intelligence, November 7–9, Boca Raton, FL (acceptance rate 30%) (Best Paper Award), pp. 1032–1039, 2011.

[PARV12a]. P. Parveen, N. McDaniel, B. Thuraisingham, L. Khan, “Unsupervised Ensemble Based Learning for Insider Threat Detection,” In Proceedings of 4th IEEE International Conference on Information Privacy, Security, Risk and Trust (PASSAT), September, Amsterdam, the Netherlands, pp. 718–727, 2012.

[PARV12b]. P. Parveen and B. Thuraisingham, “Unsupervised Incremental Sequence Learning for Insider Threat Detection,” In Proceedings of IEEE International Conference on Intelligence and Security (ISI), June, Washington DC, pp. 141–143, 2012.

[PARV13]. P. Parveen, N. McDaniel, J. Evans, B. Thuraisingham, K.W. Hamlen, L. Khan, “Evolving Insider Threat Detection Stream Mining Perspective,” International Journal on Artificial Intelligence Tools (World Scientific Publishing), 22(5), 1360013-1–1360013-24, 2013.

[SALE11]. M.B. Salem and S.J. Stolfo, “Modeling User Search Behavior for Masquerade Detection,” In Proceedings of Recent Advances in Intrusion Detection (RAID), pp. 181–200, 2011.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.211.66