8
A Process Framework for the Classification of Security Bug Reports

SHAHID HUSSAIN

Department of Computer and Information Science, University of Oregon, Eugene, Oregon, USA

Email: [email protected]

Abstract

Numerous organizations keep records of bug reports ruled by different types of sources. For example, in the context of software development, bugs are reported by developers, designers, testers and end users. Various studies have been performed to introduce models for the identification of security-related bugs; however, the number of security-related bug reports are misclassified due to their small ratio as compared to non-security bug reports due to the presence of security-related keywords in non-security bug reports, which might increase the time and efforts of bug engineers. In order to mitigate this issue, we have proposed a methodology to identify the important security-related keywords from the security-related bug report (SBR) and remove these keywords from non-security bug reports (NSBR) to improve the classification decisions. Firstly, the proposed method is evaluated with state-of-the-art feature selection methods to increase the classifier’s performance. Secondly, the classifier’s performance is evaluated to decrease the false positive rate (FPR) of classifiers via proposed method. The promising results indicate the significance of the proposed methodology in terms of effective identification of the bug security report.

Keywords: Bug reports, odd ratio, classification, performance

8.1 Introduction

In order to maintain software products, there is a strong need to continually assess the system and integrate changes based on user needs and demands. A bug tracking system (BTS) does this by allowing users to report bugs when using software products. This allows developers to improve the system to be less vulnerable and error-free. Bug fixing is one of the most important parts of software maintenance for client satisfaction. There are various types of bug reports reported in the BTS, but the most critical ones are those that are related to security. A security bug report (SBR) is a system security loophole that can be easily exploited; therefore, it is important to find and repair SBR [1]. Bug detection is a key concern in the current era, as we have seen a number of security breach incidents, such as the Equifax data breach, which compromised the privacy of millions of Americans [2], and the Careem ride application data breach, which affected 14 million people [3].

Over the past few years, detecting security bugs has attracted the attention of the research community, which is working on helping bug engineers immediately identify and resolve safety-related bugs. In this respect, different text-based predictive models have been developed [4-6]. These models are intended to effectively identify and classify a security bug report. However, they face the issue of misclassification of SBRs as NSBRS. There are two major reasons why SBRs are mislabeled. The first is the issue of class imbalance, because SBRs are less numerous than NSBRs in the corpus. The second reason is the lack of familiarity with the security domain, i.e., the presence of security crosswords. Security crosswords are security-related keywords that present both SBR and SNRB. Previous studies have attempted to address the misclassification issue by extracting relevant keywords and frequencies from the corpus terms [7, 8]. They used these word frequencies as features to train machine learning (ML) algorithms.

However, these studies gave rise to an increased rate of false positives. All previous studies lack in-depth exploration of security crosswords (presence of security-related keywords in SBR and NSBR). In [9], the authors introduced a bug report prediction model by first identifying the security-related keywords and then scoring them using different support functions. However, their research focused on keyword occurrences rather than their relevance to each class. We replicated the FARSEC study [9] and proposed to improve it by using a different and improved bi-normal separation (BNS) scoring method for security-related keywords. The scoring process for keywords in FARSEC has certain limitations, such as:

  • It adds bias towards the SBR class by using support functions to reduce the false positive rate. However, this may lead to ignoring the relevance of words with their labels and effect the discriminating ability to select features that are very relevant.
  • Scoring is based on the frequency of occurrence of words, which is not the correct indicator of keyword importance due to the tagging class as FARSEC has not considered the context of the presence of security crosswords in SBR and NSBR.
  • This leads to function redundancy because functions were extracted from a subset of data.

We have overcome this problem by proposing a method for scoring keywords using the feature selection technique known as BNS. The contributions contained in this document are:

  • Automatic retrieval of security-relevant keywords.
  • Scoring these security-related keywords using a more effective and improved BNS feature selection method.
  • BNS has helped us to get rid of the redundancy of features and gives unique keywords that are very impactful for efficient classification.
  • The SNB technique has helped to combat class imbalance because the model is formed only on characteristics that are strongly related to the positive class and produces correct classification results.
  • Scoring of each bug report according to key word score.
  • Removed NSBRs having words present in keywords in order to remove false positive cases.
  • Prediction models are constructed using the most appropriate features.
  • The results showed that extracting important safety keywords significantly improved the results both in terms of classification and in terms of reducing the rate of false positives.

We used the publicly available bug report data from five projects, i.e., one from Chromium projects and four from Apache projects (Ambari, Derby, Camel and Wicket). The total number of bug reports amounts to approximately 45,940. The dataset is highly imbalanced as only 0.8% of the bug reports are related to the security class. The formulated research questions are as follows:

  • Research Question 1 (RQ1): How is the proposed methodology effective in constructing an effective model since SBR data is very small compared to NSBR data?
  • Research Question 2 (RQ2): How is the proposed methodology effective in addressing the issue of cross-security words that are present in the NSBR and contributing to the misclassification of SBR?

The remainder of this chapter is organized as follows: Related research is presented in Section 8.2, the system model is provided in Section 8.3, and Section 8.4 outlines the proposed regime’s results.

8.2 Related Work

8.2.1 Text Mining for Security Bug Report Prediction

Security bug report tracking system analyzes a large number of security reports. One of the important tasks is to identify security bug reports and classify them as security and non-security related. Goseva-Popstojanova and Tyo [7] proposed a supervised as well as unsupervised automated algorithm for rating security bug reports. Both approaches employ three kinds of feature vectors. This approach analyzes the effect of different classifiers and the varying size of training data in the supervised technique. It also investigates the unsupervised domain in context of anomaly detection. The evaluation was carried out on three NASA datasets. Though it is high-performing, it does not take into account security words that are not present in the vocabulary and also requires labeling of test data.

In [9], the authors introduced the IDF N-gram technique that extracts the keyword from any length and these keywords can be used as features. Its performance was better than thematic modeling models, but only a small number of domain-specific datasets were tested and the performance changed as the number of grams changed. In their study [12], the authors used several types of characteristics, i.e., meta-characteristics and textual characteristics, for automatically identifying the security bug report. The extracted multi-type functionality is then used to predict bug reports. In [9], the authors introduced a bug report prediction template by first identifying relevant security-related keywords using the scoring method and then removing the NSBRs having these security crosswords in order to decrease the false positive rate. However, their study was based on the presence of security-related keywords and did not consider the optimal way to select a key word as a security crossword. To overcome this problem, we have proposed a framework that automatically extracts safety-related keywords and these keywords are scaled using bi-normal separation (BNS).

8.2.2 Machine Learning Algorithms-Based Prediction

Our research uses a Na¨ıve Bayes classifier and this algorithm uses Bayes’ theorem. Moreover, it predicts the class according to the probability of a data instance related to this particular class.

8.2.3 Bi-Normal Separation for Feature Selection

Feature selection is one of the most important factors to boost the performance of the classifier. The most relevant and robust features will affect the best performance of the classifier. There are different types of feature selection techniques, including filter techniques, wrap techniques and embedded techniques. This study implemented various feature selection techniques on the textual data and the results showed that BNS has the second best performance on short document data [13, 14, 25].

In this study, the authors have indicated that BNS is the most feasible feature selection technique when the data is highly biased. A recent study has also shown that BNS outperforms TF-IDF in textual analysis [16]. Keep this research in mind and analyze it since our document data is short and the data set is heavily biased. We took advantage of the BNS feature selection technique to score safety keywords. Scaling BNS helps us understand the underlying context and appropriateness of safety-related keywords for each class. As a result of this stage, we get very important and influential security-related keywords and these keywords are treated as functionalities to train classifiers.

8.3 Proposed Methodology

Figure 8.1 illustrates the proposed system design. It is an extension of [9] and the steps involved are given in subsection 8.3.1.

Schematic illustration of proposed system model.

Figure 8.1: Proposed system model.

8.3.1 Data Gathering and Preprocessing

  • Data Gathering: We collected bug report data from the publicly available Chromium project as well as 4 Apache projects (Ambari, Wicket, Camel and Derby). These datasets are highly imbalanced in terms of the number of security bug reports and are much less numerous than non-security bug reports. That makes a total of 45940 bug reports.
  • Text Preprocessing: Pre-processing text is the key to extracting text. It not only reduces the size of the document file, but it also cleans up the text by deleting useless data like punctuations, links, numbers and so on. In addition, the data are prepared to form the classifiers used in the proposed study. The pre-processing of texts considerably improves the results of the classification. We used the Scikit library [22] to carry out the pre-processing. The steps involved are:
  • Tokenization: This is the method by which the phrase or text is divided into small pieces called tokens. These tokens can consist of words, characters and even phrases, which become inputs to the text mining algorithm. Tokenization helps to explore the document in the form needed for mining.
  • Text Cleaning: It consists of filtering the text by deleting figures, punctuations, stop words and unnecessary data. Stop words are common words that occur but are not important and used to form the sentence structure, such as a, and, the, in, about, etc. They have to be taken out of the text.
  • Text Lemmatization: This is the phenomenon where words that are morphologically linked are considered to be a single word. It groups together words that relate to a common word, noun or verb and treats them as a single word.
  • Text Stemming: This is one of the most widely used and common approaches to text preprocessing. The idea of stem is to cut out words that are made by stretching a specific root word.

8.3.2 Identifying Security-Related Keywords

In this subsection, we explain how security-related keywords are identified from the corpus.

  • Tokenization: This is the process of converting text into little chunks (words) called tokens.
  • Weighting Method: We employed term frequency-inverse document frequency (TFIDF) weighting methods. It filters words associated with a document. We used TFIDF to get a list of security-relevant keywords.
  • Indexing: We used the document-term matrix for indexing, where rows refer to bug reports while columns are the security-related keywords obtained in the previous step.

8.3.3 Scoring Keywords

The security keywords obtained in the previous section are now scored in accordance with Algorithm 8.1 proposed using the BNS scale. This step makes it possible to identify the underlying context of the security crossword. Each security keyword will be checked against the SBR and NSBR. If the keyword gets high scores against the SBR class, it means that the keyword is very influential and important for the SBR class. Furthermore, this keyword will act as a security crossword in NSBR. NSBR with security crosswords will be eliminated to decrease the rate of false positives.

8.3.4 Scoring Bug Reports

Algorithm 8.2 is designed to record bug reports. Each bug report is noted on the basis of security-related keywords present in a bug report. If a security keyword is present in a bug report, the corresponding score will be added. Otherwise, zero will be added to the sum. If the total NSBR score is above the threshold, i.e., 0.75, then the NSBR will be selected. If the score is below the threshold, the NSBR is considered to be falsely positive and is pruned.

8.4 Experimental Setup

Experimental parameters, applied algorithm and performance assessment parameters, are addressed in this section. We performed an experiment using Python’s Scikit library.

8.4.1 Machine Learning Algorithm

The machine learning algorithm used in this approach is Na¨ıve Bayes. Na¨ıve Bayes is considered to be the most effective and efficient machine learning algorithm. Using various studies and research, Lessmann et al. [23], Menzies et al. [24] and Hussain et al. [26] concluded that for software flaw prediction, Na¨ıve Bayes works extremely well compared with other machine learning algorithms [27-30].

8.4.2 Dataset

We used 5 labeled datasets of bug reports, making a total of about 45940 bug reports. The dataset characteristics are presented in Table 8.1.

8.4.3 Performance Evaluation

To evaluate the proposed approach, we used Precision, Recall and F-measure to evaluate the effectiveness of the proposed approach. We also compared the recall change to FAR-SEC with better results.

8.5 Results and Discussion

We have performed certain experiments by considering public datasets and Na¨ıve Bayes classifier.

8.5.1 Response to RQ1

It is possible to build an efficient model in spite of having a class imbalance problem if one works on an efficient selection of features. We built it in with the help of Algorithm 8.1, because we assessed security-related keywords and used very relevant words as characteristics. We used Na¨ıve Bayes as our basic classifier and compared our feature selection technique with FARSEC.

The results showed that integrating our methodology with this classifier was more effective. Figure 8.2 provides a more detailed illustration of the findings.

Bar graphs depict the performance evaluation of datasets using precision, recall and F-measure.

Figure 8.2: Performance evaluation of datasets using precision, recall and F-measure.

In Figure 8.2, precision, recall and F-measure values for NB were presented with each dataset considered in the proposed study. We performed higher recalls for every dataset. The highest recall value of NB on each dataset indicates the effectiveness of the model in terms of prediction of the most relevant cases. Therefore, we considered the performance measurement of the recall when we responded to RQ2.

8.5.2 Response to RQ2

The issue of the presence of security crosswords in NSBRs is removed by Algorithm 8.2, which scores the bug report. This algorithm calculates the scores for each bug report and deletes the NSBRs according to the defined threshold. If we look at Figure 8.3, we can compare the change in recalls of the FARSEC methodology and our proposed methodology. The graph may predict that the recall rate is higher than that of FARSEC, implying that the algorithm predicted the relevant results more precisely.

Graph depicts the change in recall.

Figure 8.3: Change in recall.

8.6 Conclusion

This research study provided a framework for mitigating labeling errors in security bug reports. One of the factors contributing to the misclassification is class imbalance, because security bug reports are less numerous than non-security bug reports, and secondly, there are crosswords. We have proposed a methodology to mitigate the crossword issue by giving scores in favor of the security bug report to each word and using highly relevant words as features of the classification. In order to address the question of class imbalance, we noted each bug report based on the score reached earlier and deleted the NSBRs that have a score higher than the specified threshold.

In the future, we plan to work on how we can improve and resolve the problem of class imbalance. Apart from this, we can work to improve how learners can choose the automatic cut-off threshold for NSBR.

References

1. Shu, R., Xia, T., Williams, L., & Menzies, T. (2019). Better security bug report classification via hyperparameter optimization. arXiv preprint arXiv:1905.06872.

2. Gressin, Seena. ”The equifax data breach: What to do.” Federal Trade Commission 8 (2017). https://www.ftc.gov/equifax-data-breach.

3. https://blog.careem.com/en/security/

4. Chawla, I., & Singh, S. K. (2014, August). Automatic bug labeling using semantic information from LSI. In 2014 Seventh International Conference on Contemporary Computing (IC3) (pp. 376-381). IEEE.

5. Xia, X., Lo, D., Qiu, W., Wang, X., & Zhou, B. (2014, July). Automated configuration bug report prediction using text mining. In 2014 IEEE 38th Annual Computer Software and Applications Conference (pp. 107-116). IEEE.

6. Xia, X., Lo, D., Shihab, E., & Wang, X. (2015). Automated bug report field reassignment and refinement prediction. IEEE Transactions on Reliability, 65(3), 1094-1113.

7. Goseva-Popstojanova, K., & Tyo, J. (2018, July). Identification of security related bug reports via text mining using supervised and unsupervised classification. In 2018 IEEE International Conference on Software Quality, Reliability and Security (QRS) (pp. 344-355). IEEE.

8. Khan, A. A., Shameem, M., Nadeem, M., & Akbar, M. A. (2021). Agile trends in Chinese global software development industry: Fuzzy AHP based conceptual mapping. Applied Soft Computing, 102, 107090.

9. Peters, F., Tun, T. T., Yu, Y., & Nuseibeh, B. (2017). Text filtering and ranking for security bug report prediction. IEEE Transactions on Software Engineering, 45(6), 615-631.

10. Terdchanakul, P., Hata, H., Phannachitta, P., & Matsumoto, K. (2017, September). Bug or not? bug report classification using n-gram idf. In 2017 IEEE International Conference on Software Maintenance and Evolution (ICSME) (pp. 534-538). IEEE.

11. Le, D. N., Nguyen, G. N., Garg, H., Huynh, Q. T., Bao, T. N., & Tuan, N. N. (2021). Optimizing Bidders Selection of Multi-Round Procurement Problem in Software Project Management Using Parallel Max-Min Ant System Algorithm. CMC-COMPUTERS MATERIALS & CONTINUA, 66(1), 993-1010.

12. Zou, D., Deng, Z., Li, Z., & Jin, H. (2018, July). Automatically identifying security bug reports via multitype features analysis. In Australasian Conference on Information Security and Privacy (pp. 619-633). Springer, Cham.

13. Abbasi, B. Z., Hussain, S., Bibi, S., & Shah, M. A. (2018, September). Impact of Membership and Non-membership Features on Classification Decision: An Empirical Study for Appraisal of Feature Selection Methods. In 2018 24th International Conference on Automation and Computing (ICAC) (pp. 1-6). IEEE.

14. Asim, M. N., Wasim, M., Ali, M. S., & Rehman, A. (2017, November). Comparison of feature selection methods in text classification on highly skewed datasets. In 2017 First International Conference on Latest trends in Electrical Engineering and Computing Technologies (INTELLECT) (pp. 1-8). IEEE.

15. Tang, L., & Liu, H. (2005, November). Bias analysis in text classification for highly skewed data. In Fifth IEEE International Conference on Data Mining (ICDM’05) (pp. 4-pp). IEEE.

16. Baillargeon, J. T., Lamontagne, L., & Marceau, É. (2019, May). Weighting Words Using Bi-Normal Separation for Text Classification Tasks with Multiple Classes. In Canadian Conference on Artificial Intelligence (pp. 433-439). Springer, Cham.

17. Bishop, C.M. (1995). Neural networks for pattern recognition. Ox-ford University Press.

18. Rish, I. (2001, August). An empirical study of the naive Bayes classifier. In IJCAI 2001 workshop on empirical methods in artificial intelligence (Vol. 3, No. 22, pp. 41-46).

19. Keller, J. M., Gray, M. R., & Givens, J. A. (1985). A fuzzy k-nearest neighbor algorithm. IEEE Transactions on Systems, Man, and Cybernetics, (4), 580-585.

20. Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5-32.

21. Hosmer Jr, D. W., Lemeshow, S., & Sturdivant, R. X. (2013). Applied logistic regression (Vol. 398). John Wiley & Sons.

22. Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., ... & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. The Journal of Machine Learning Research, 12, 2825-2830.

23. Lessmann, S., Baesens, B., Mues, C., & Pietsch, S. (2008). Benchmarking classification models for software defect prediction: A proposed framework and novel findings. IEEE Transactions on Software Engineering, 34(4), 485-496.

24. Menzies, T., Greenwald, J., & Frank, A. (2006). Data mining static code attributes to learn defect predictors. IEEE Transactions on Software Engineering, 33(1), 2-13.

25. Hussain, S., Mufti, M. R., Sohail, M. K., Afzal, H., Ahmad, G., & Khan, A. A. (2019). A Step towards the Improvement in the Performance of Text Classification. KSII Transactions on Internet and Information Systems (TIIS), 13(4), 2162-2179.

26. Hussain, S., Keung, J., Khan, A. A., & Bennin, K. E. (2015, September). Performance evaluation of ensemble methods for software fault prediction: An experiment. In Proceedings of the ASWEC 2015 24th Australasian Software Engineering Conference (pp. 91-95).

27. Hussain, S., Keung, J., Sohail, M. K., Khan, A. A., Ahmad, G., Mufti, M. R., & Khatak, H. A. (2019). Methodology for the quantification of the effect of patterns and anti-patterns association on the software quality. IET Software, 13(5), 414-422.

28. Khan, A. A., Shameem, M., Kumar, R. R., Hussain, S., & Yan, X. (2019). Fuzzy AHP based prioritization and taxonomy of software process improvement success factors in global software development. Applied Soft Computing, 83, 105648.

29. Bao, T. N., Huynh, Q. T., Nguyen, X. T., Nguyen, G. N., & Le, D. N. (2020). A Novel Particle Swarm Optimization Approach to Support Decision-Making in the Multi-Round of an Auction by Game Theory. International Journal of Computational Intelligence Systems, 13(1), 1447-1463.

30. Le, D. N. (2017). A new ant algorithm for optimal service selection with end-to-end QoS constraints. Journal of Internet Technology, 18(5), 1017-1030.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.198.21