Chapter 6

Model analytics for defect prediction based on design-level metrics and sampling techniques

Aydin KayaaAli Seydi KeceliaCagatay CatalbBedir Tekinerdoganb    aDepartment of Computer Engineering, Hacettepe University, Ankara, Turkey
bInformation Technology Group, Wageningen University, Wageningen, The Netherlands

Abstract

Predicting software defects in the early stages of the software development life cycle, such as the design and requirement analysis phase, provides significant economic advantages for software companies. Model analytics for defect prediction lets quality assurance groups build prediction models earlier and predict the defect-prone components before the testing phase for in-depth testing. In this study, we demonstrate that Machine Learning-based defect prediction models using design-level metrics in conjunction with data sampling techniques are effective in finding software defects. We show that design-level attributes have a strong correlation with the probability of defects and the SMOTE data sampling approach improves the performance of prediction models. When design-level metrics are applied, the Adaboost ensemble method provides the best performance to detect the minority class samples.

Keywords

defect prediction; design-level metrics; sampling techniques; software defects; model analytics

6.1 Introduction

Software defect prediction is a quality assurance activity which applies historical defect data in conjunction with software metrics [13]. It identifies which software modules are defect-prone before the testing phase, and then quality assurance groups allocate more testing resources into these modules because the other group of modules, called nondefect-prone, will not likely cause defects based on the output of the prediction approach [4]. Most of the studies in literature use classification algorithms in Machine Learning to classify software modules into two categories (defect-prone and nondefect-prone) [1].

Prior defects and software metrics data are mostly stored in different repositories, namely, bug tracking systems and source code repositories, and the processing of these data depends on the underlying platforms. Sometimes the mapping of each defect into the appropriate source code introduces errors and it is not a trivial task. While this defect prediction activity provides several benefits to organizations, only a limited number of companies have yet adopted this approach as part of their daily development practices. One strategy to increase the adoption of these approaches is to develop new models to increase the performance of existing defect prediction models.

In this study, we focus on data sampling techniques for the improvement because defect prediction datasets are always unbalanced, which means that approximately 10–20% of modules belong to the defect-prone category. While there are several attempts to apply sampling techniques in this domain, there is not an in-depth study which analyzes several classification algorithms in conjunction with design-level metrics and sampling approaches. Hence, we performed several experiments on ten public datasets by using six classification algorithms and design-level metrics.

The object-oriented Chidamber–Kemerer (CK) metrics suite, which includes six class-level metrics, is used in this study. Since datasets include additional features such as lines of code, we performed another case study to compare prediction models having the total set of features. We adopted the following classification algorithms because they have been widely used in defect prediction studies: AdaBoostM1, Linear Discriminant, Linear Support Vector Machine (SVM), Random Forest, Subspace Discriminant, and Weighted-kNN (W-kNN). The following six performance evaluation parameters are used to evaluate the performance of models: Area under ROC Curve (AUC), Recall, Precision, F-score, Sensitivity, and Specificity [5].

We analyzed the effects of the following sampling methods on the performance of defect prediction models using design-level metrics: ADASYN, Borderline SMOTE, and SMOTE. We compared their performance against the ones built based on the unbalanced data.

Our research questions are given as follows:

  • •  RQ1: Which sampling techniques are more effective to improve the performance of defect prediction models?
  • •  RQ2: Which classifiers are more effective in predicting software defects when sampling techniques are applied?
  • •  RQ3: Are design-level metrics (CK metrics suite) suitable to build defect prediction models when sampling techniques are applied?

The contribution of this chapter is two-fold:

  • •  The effects of sampling techniques are investigated for software defect prediction problem in detail.
  • •  The effects of six classification algorithms and design-level software metrics are evaluated on ten public datasets.

This paper is organized as follows: Section 6.2 explains the background and related work, Section 6.3 shows the methodology regarding our experimental results, Section 6.4 explains the experimental results, Section 6.5 provides the discussion, and Section 6.6 shows the conclusion and future work.

6.2 Background and related work

Software defect prediction, a quality assurance activity, identifies defect-prone modules before the testing phase and therefore, more testing resources are allocated to these modules to detect defects before the software deployment. It is a very active research field which has attracted many researchers in the software engineering community since the middle of the 2000s [4,1].

Machine Learning algorithms are widely used in these approaches and the prediction models are built using software metrics and defect labels. Software metrics can be collected at several levels, i.e., the requirements level, design level, implementation level, and process level. In the first case study of this paper, we applied the CK metrics suite, which is a set of design-level metrics. CK metrics are explained as follows:

  • •  The Weighted Methods per Class (WMC) metric shows the number of methods in a class.
  • •  The Depth of Inheritance Tree (DIT) metric indicates the distance of the longest path from a class to the root element in the inheritance hierarchy.
  • •  The Response For a Class (RFC) metric provides the number of available methods to respond a message.
  • •  The Number Of Children (NOC) indicates the number of direct descendants of a class.
  • •  The Coupling Between Object classes (CBO) metric indicates the number of noninheritance-related classes to which a class is coupled.
  • •  The Lack of COhesion in Methods (LCOM) metric indicates the access ratio of the class' attributes.

Software metrics are features of the model and the defect labels are class labels. From a Machine Learning perspective, these prediction models can be considered as classification models since we have two groups of data instances, namely, defect-prone and nondefect-prone. Defect prediction datasets are imbalanced [6], which means that most of the modules (i.e., 85–90%) in these datasets belong to the nondefect-prone class. Therefore, classification algorithms in Machine Learning might not detect the minority of data points due to the imbalanced characteristics of datasets.

The Machine Learning community has done a lot of research on the imbalanced learning so far [7,8], but the empirical software engineering community has not evaluated the impact of these algorithms in detail yet compared to the Machine Learning researchers. Algorithms in imbalanced learning can be shown in the following four categories [9]:

  1. 1.  Subsampling [10]: With subsampling, data distribution is balanced before the classification algorithm is applied. The following four main approaches exist in this category [9]:
    1. (a)  Under-sampling: A subset of the majority class is selected to balance the distribution of the data points, but this might cause loss of some useful data.
    2. (b)  Over-sampling: A random replication of the minority class is created, but this might cause an over-fitting problem [11].
    3. (c)  SMOTE [10]: This is a very popular over-sampling method and it successfully avoids over-fitting by generating new minority data points with the help of interpolation between near neighbors.
    4. (d)  Hybrid methods [12]: These methods combine several subsampling methods to balance the dataset.
  2. 2.  Cost-sensitive learning [13]: A false negative prediction is more costly than a false positive prediction in software defect prediction. This approach uses a cost matrix which specifies the costs of misclassification for each class, and then this matrix is applied for the optimization of the training data [9]. Since misclassification costs are not publicly available, their applications require the knowledge of domain experts.
  3. 3.  Ensemble learning [14]: The generalization capability of different classifiers are combined, and then the new classifier, which is called ensemble of classifiers or multiple classifiers system (MCS), provides a better performance than the individual classifiers used to design it. Bagging [15] and Boosting [16] are some of the most popular algorithms used in this category. AdaBoost is shown in one of the top ten algorithms in data mining [17,9].
  4. 4.  Imbalanced ensemble learning [11]: These algorithms combine the subsampling methods with the ensemble learning algorithms. If the over-sampling method is used in the Bagging approach instead of random sampling in Bagging, this is known as OverBagging [18]. There are several approaches which use the same idea [1921].

In this study, we used algorithms from subsampling (ADASYN, Borderline SMOTE, and SMOTE algorithms) and ensemble learning (AdaBoostM1 algorithm) categories. We did not experiment with algorithms in the cost-sensitive learning category because there is not an easy way to identify the costs for the analysis. In addition to these algorithms, we used additional classification algorithms in conjunction with imbalanced learning algorithms. Since AdaboostM1 is an ensemble learning algorithm and we used it in conjunction with subsampling techniques, we can say that we also utilized from the category called imbalanced ensemble learning.

We applied two additional subsampling approaches, called ADASYN and Borderline SMOTE, in our study compared to the study of Song et al. [9]. Also, we applied two additional classification algorithms, called Linear Discriminant and Subspace Discriminant, which were not analyzed in the study of Song et al. [9].

6.3 Methodology

6.3.1 Classification methods

Six classification methods have been investigated in this study. These are Random Forest, Adaboost, SVM, Linear Discriminant Analysis (LDA), Subspace Discriminant, and W-kNN. Random Forest, Adaboost, and Subspace Discriminant are ensemble classification methods which combine single classifiers to obtain better predictive performance. Single classifiers during our experiments are SVM, LDA, and W-kNN. These classifiers are widely used base methods for ensemble classification. A brief description of these ensemble and single Machine Learning algorithms are provided in the following subsections.

6.3.2 Ensemble classifiers

The first ensemble method used in our experiments is Random Forest. Random Forest is a combination of multiple decision trees [22]. Bootstrapping is applied for sample selection for each tree in the forest. Two-thirds of the selected data is used to train a tree, and the classification is performed with the remaining data. Majority voting is applied to obtain the final prediction result. The Random Forest algorithm counts the votes from all the trees and the majority of the votes is used for the classification output. Random Forest is easy to use, it prevents over-fitting, and it stores the generated decision tree cluster for the other datasets.

The second ensemble method is Subspace Discriminant, which uses linear discriminant classifiers. Subspace Discriminant utilizes a feature-based bagging. In this method, feature bagging is applied to reduce the correlation between estimators; however, the difference compared to the bagging is that the features are randomly subsampled, having a replacement for each learner. Learners that are specialized on different feature sets are obtained.

The final ensemble method is Adaboost. The AdaBoost method, proposed by Freund and Schapire [23], utilizes boosting to obtain strong classifiers by combining the weak classifiers. In boosting, training data are split into parts. A predictive model is trained with the parts of the model and this model is tested with one of the remaining parts. Then, a second model is trained with the false predicted samples of the first model. This process is repeated. New models are trained with the samples falsely predicted by previous models.

6.3.3 Single classifiers

The first and most well-known single classifier used in our experiments is SVM. SVM is a common method among Machine Learning tasks [24]. In this method, the classification is performed by using linear and nonlinear kernels. The SVM method aims to find the hyperplane that separates the data points in the training set with the farthest distance.

The second single classifier is LDA. LDA projects a dataset into a lower-dimensional feature space to increase the separability between classes [25]. The first step of LDA is the computation of the variances between class features to measure the separability. The second step is to calculate the distance between the mean and samples of each class. The final step is to construct the lower-dimensional space that maximizes the interclass variance and minimizes the intraclass variance.

Our last single classifier method is W-kNN. W-kNN is an extension of the k-Nearest Neighbor algorithm. In the standard kNN, influences of all neighbors are the same although they have different individual similarity. In W-kNN, training samples which are close to the new observation have more weights than the distant ones [26].

6.4 Experimental results

During the experiments, we applied three data sampling methods and six classification methods on ten publicly available datasets. Three-fold cross-validation was applied for the evaluation of approaches. Each experiment was repeated fifty times. The experimental results are compared based on five evaluation metrics (AUC, Recall, Precision, F-score, and Specificity). The features are grouped as CK features, and ALL features include all features presented in datasets. The selected datasets are ant, arc, ivy, log4j, poi, prop, redactor, synapse, xalan, and xerces.

6.4.1 Experiments based on CK metrics

During these experiments, CK features were used and results were evaluated according to this feature set. The average scores of the classification methods on all datasets are presented in Fig. 6.1 and Table 6.1. Random Forest and Adaboost are the most successful classifiers based on the AUC metric (0.7396, 0.7269). The Adaboost classifier provides the best performance to identify the positive samples (i.e., minority class) (recall: 0.6083), however its performance is the worst for the identification of negative samples (i.e., majority class) (specificity: 0.6903). Linear Discriminant, Subspace Discriminant, and Linear SVM exhibited similar performance and their performance is better to identify the negative samples.

Image
Figure 6.1 Average scores of the classifiers trained and tested with CK features.

Table 6.1

Average scores of the classifiers trained and tested with CK features.

ClassifiersAverage of AUCAverage of recallAverage of precisionAverage of F-scoreAverage of specificity
AdaBoostM10.72690.60830.52650.54290.6903
Linear Discriminant0.70980.54350.54080.50190.7446
Linear SVM0.69430.53970.51480.49080.7465
Random Forest0.73960.56580.56140.55760.7381
Subs Discriminant0.71570.54140.54580.50260.7495
Weighted-kNN0.70570.58880.52520.53010.7062

Image

The average scores of the classification methods grouped under sampling methods on all datasets are presented in Fig. 6.2 and Table 6.2. AUC and Specificity performances of the classifiers on balanced and unbalanced data are similar with improvement/decline of 1–2%. Besides, sampling methods improve the Recall metric by 13%. The effect of sampling methods is obvious on the identification of the positive samples, which is the minority class. Based on the AUC metric, SMOTE is the most successful sampling method on the datasets including CK features.

Image
Figure 6.2 Average scores of the sampling methods with CK features.

Table 6.2

Average scores of the sampling methods with CK features.

Sampling methodAverage of AUCAverage of recallAverage of precisionAverage of F-scoreAverage of specificity
ADASYN0.72320.60650.52820.53420.7286
BL SMOTE0.71210.5950.51460.52080.7138
SMOTE0.72470.59330.53740.53750.7412
Unbalanced0.70130.46340.56280.49140.7332

Image

6.4.2 Experiments based on ALL metrics

In these experiments, all of the features were used, and the results are presented according to this feature set [27]. Features and datasets can be accessed from the SEACRAFT repository http://tiny.cc/seacraft. Jureczko and Madeyski [27] used the CKJM tool to collect the 19 metrics, i.e., CK metrics suite, QMOOD metrics suite, Tang, Kao and Chen's metrics, and cyclomatic complexity, LCOM3, Ca, Ce, and LOC metrics. They collected defects using the BugInfo tool.

The average scores of the classification methods on all datasets are presented in Fig. 6.3 and Table 6.3. Random Forest and Adaboost are the most successful classifiers based on the AUC metric (0.7558, 0.7363), as in the experiments with CK features. Adaboost and Subspace Discriminant classifiers provide the best performance to identify the positive samples (recall: 0.5430, 0.5420). In contrast to the CK experiments, the Adaboost classifier is one of the best performing classifiers to identify the negative samples (specificity: 0.7576). Random Forest is the best classifier for the identification of the majority class (specificity: 0.7692). The performance of Linear SVM (AUC: 0.7116) and W-kNN (AUC: 0.6776) is lower than that of the other algorithms.

Image
Figure 6.3 Average scores of the classifiers trained and tested with CK features.

Table 6.3

Average scores of the classifiers trained and tested with ALL features.

ClassifierAverage of AUCAverage of recallAverage of precisionAverage of F-scoreAverage of specificity
AdaBoostM10.73630.54300.56080.52560.7576
Linear Discriminant0.69940.55300.58090.52760.7497
Linear SVM0.71610.53090.54690.48920.7150
Random Forest0.75580.53250.61160.54310.7692
Subs Discriminant0.72980.54200.59460.51540.7262
Weighted-kNN0.67760.52080.51380.47570.7223

Image

The average scores of the classification methods grouped under sampling methods on all datasets are presented in Fig. 6.4 and Table 6.4. The performance of classifiers with respect to AUC and Specificity parameters on balanced (ADASYN and BL SMOTE methods) and unbalanced data are similar, with improvement/decline of 1–3%. Besides, ADASYN and BL SMOTE sampling methods improve the recall metric by 14%; however, this effect is reversed when the SMOTE method is considered. The average AUC value drops from 0.7445 to 0.6357 in the case of the SMOTE data sampling method. Precision score is the highest for the classifiers on unbalanced data. The effect of sampling methods is obvious on the identification of the positive samples which belong to the minority class. In contrast to CK experiments, SMOTE is the worst performing sampling method on the datasets including ALL features, and the recall value drops from 0.5025 to 0.3613 when SMOTE is applied. Therefore, SMOTE does not help to identify the instances belonging to the minority class when all the features are used.

Image
Figure 6.4 Average scores of the sampling methods with ALL features.

Table 6.4

Average scores of the sampling methods with ALL features.

Sampling methodAverage of AUCAverage of recallAverage of precisionAverage of F-scoreAverage of specificity
ADASYN0.74990.6430.55140.57280.7296
BL SMOTE0.74650.64140.54430.56670.7233
SMOTE0.63570.36130.54180.36630.7572
Unbalanced0.74450.50250.63490.54520.7499

Image

6.4.3 Comparison of the features

In this section, the experiments with CK and ALL features are compared. By using the values from Table 6.1 and Table 6.3, the comparison of average outcomes of the classification methods is presented in Fig. 6.5. In this figure, negative values denote the superiority of ALL features, while positive values denote the superiority of CK features. While the CK features set is more successful on the identification of the positive samples, the ALL features set is better on the negative samples. It seems that the difference with respect to AUC and F-score values is low. Besides, classifiers trained with ALL features provide better precision scores.

Image
Figure 6.5 Difference between outcomes of the classifiers with CK and ALL features. Negative values denote the superiority of ALL features, positive values denote the superiority of CK features.

By using the values from Table 6.2 and Table 6.4, the comparison of average outcomes of the sampling methods is presented in Fig. 6.6. In this figure, negative values denote the superiority of ALL features, while positive values denote the superiority of CK features. While the classifiers trained with balanced data using the SMOTE method provided better AUC, F-score, and recall outcomes on CK features, the other sampling methods are more beneficial on ALL features and provided better outcomes for all scores.

Image
Figure 6.6 Difference between outcomes of the sampling methods with CK and ALL features. While negative values denote the superiority of ALL features, positive values denote the superiority of CK features.

In Fig. 6.7, the difference between outcomes of the classifiers with CK and ALL features is presented on the unbalanced datasets. In this figure, negative values denote the superiority of ALL features, positive values denote the superiority of CK features. The outcomes of the classifiers with ALL features are better except the W-kNN classifier. ALL features are more preferable than CK features when it comes to dealing with unbalanced datasets.

Image
Figure 6.7 The difference between outcomes of the classifiers with CK and ALL features on the unbalanced datasets. Negative values denote the superiority of ALL features, positive values denote the superiority of CK features.

In Table 6.5, statistical significance results obtained by McNemar's test are presented. All the classifier/sampling method combinations are compared as CK vs. ALL features for all datasets. In the table, classification methods are grouped for each data sampling method. P-values are shown for each dataset. The significant values (P<0.05Image) are indicated as italic and boldfaced.

Table 6.5

Statistical significance analysis results according to the comparison of CK and ALL features. Classification methods are grouped under each data sampling method. P-values are denoted respectively for each dataset. Significant values are presented in italics.

Methodsantarcivylog4jpoipropred.syn.xalanxerces
ADASYN
AdaBoostM10.000.010.000.000.300.000.000.000.000.02
Linear Discriminant0.000.100.790.000.000.520.000.120.000.00
Linear SVM0.550.010.050.000.020.040.000.010.000.04
Random Forest0.170.120.000.010.520.010.190.660.320.87
Subs Discriminant0.000.070.470.000.000.140.000.630.000.20
Weighted-kNN0.081.000.520.370.680.920.260.000.000.39
BL SMOTE
AdaBoostM10.000.010.000.000.230.000.000.230.010.64
Linear Discriminant0.000.031.000.000.000.880.000.900.000.00
Linear SVM0.020.020.490.020.000.460.000.200.000.71
Random Forest1.000.030.260.170.720.730.000.050.030.40
Subs Discriminant0.000.580.650.000.000.490.000.150.000.87
Weighted-kNN0.700.300.280.440.680.060.880.820.000.48
SMOTE
AdaBoostM10.000.000.000.000.000.000.000.000.000.00
Linear Discriminant0.000.000.000.000.000.000.000.000.230.00
Linear SVM0.000.000.000.000.000.000.000.000.000.00
Random Forest0.000.000.000.000.000.000.000.000.000.00
Subs Discriminant0.000.000.000.000.000.000.000.000.000.00
Weighted-kNN0.000.000.000.000.000.000.000.000.000.00
Unbalanced
AdaBoostM10.010.180.430.370.390.620.060.370.410.32
Linear Discriminant0.851.000.820.180.000.030.000.030.160.00
Linear SVM0.170.160.000.450.000.550.000.490.600.35
Random Forest1.000.050.320.260.900.000.800.630.450.47
Subs Discriminant0.030.210.100.320.000.530.011.000.440.00
Weighted-kNN0.920.780.680.100.010.420.780.580.560.06

Image

Based on Table 6.5, Fig. 6.5, and Fig. 6.6, we conclude that the SMOTE sampling method provides 100% significant values for our experiments, and this significance shows the superiority of the data with CK features when it is balanced using the SMOTE method. For other sampling methods and unbalanced data, the significant instances are also present, but there are some insignificant ones too. ALL features provide better outcomes on the unbalanced data and the balanced data with data sampling methods except SMOTE.

6.5 Discussion

In this study, we performed several experiments to evaluate the effect of design-level metrics and data sampling methods on the performance of defect prediction models. Research questions of this study were introduced in Section 6.1, and in this section we present our responses to each of them.

  • •  RQ1: Which sampling techniques are more effective to improve the performance of defect prediction models?
  • The answer to RQ1: Regarding the CK metrics suite, the SMOTE data sampling method is the best one; for the ALL metrics suite, the other data sampling methods are preferable.
  • •  RQ2: Which classifiers are more effective in predicting software defects when sampling techniques are applied?
  • The answer to RQ2: Regarding the CK metrics, the Random Forest algorithm is the most effective one with respect to the AUC value. Adaboost is the second best performing algorithm and it is also the best algorithm to identify the minority class instances.
  • Regarding the ALL metrics set, again the Random Forest algorithm provides the best performance and Adaboost is the second best performing one. Adaboost and Linear Discriminant are the best performing algorithms for the identification of minority class instances. For the detection of majority class instances, Adaboost is the best classifier.
  • •  RQ3: Are design-level metrics (CK metrics suite) suitable to build defect prediction models when sampling techniques are applied?
  • The answer to RQ3: We showed that the CK features set is more successful for the identification of minority class instances and the ALL features set is better for the identification of majority class instances. When CK metrics are used with the Random Forest algorithm, the best performance was achieved using the CK metrics suite. When additional features are added to the CK metrics suite, the performance of a Random Forest-based model with respect to AUC value improved from 0.7396 to 0.7558.

Potential threats to validity for this study are evaluated from four dimensions, namely, conclusion validity, construct validity, internal validity, and external validity [28]. Regarding the conclusion validity, we applied a three-fold cross-validation evaluation technique 50 times (N * M cross-validation, N = 3, M = 50) to get statistically significant results and avoid randomness during the experiments. In addition to this validation technique, we performed McNemar's test to evaluate the statistical significance of our results. Five evaluation parameters were used to report the experimental results and ten public datasets were investigated. For the construct validity dimension, we can state that we applied the widely used ten public datasets for the software defect prediction problem. Regarding the internal validity, we investigated three single classifiers, three ensemble techniques, and three data sampling techniques in this study to evaluate the impact of classifiers, metrics suite, and data sampling techniques on software defect prediction. While we selected the best performing algorithms, other researchers might analyze the other techniques with different parameters and therefore, they can obtain different results. For the external validity dimension, we must stress that our conclusions are valid on datasets explained in the paper and results might be different on a different set of datasets.

6.6 Conclusion

Software defect prediction is an active research problem [29,9,3033]. Hundreds of papers on this problem have been published so far [32]. In this study, we focused on the impact of data balancing methods, ensemble methods, and single classification algorithms for software defect prediction problems. We empirically showed that design-level metrics can be used to build defect prediction models. In addition, when design-level metrics (CK metrics suite) are applied, we demonstrated that the SMOTE data sampling approach can improve the performance of prediction models. Among other ensemble methods, we observed that the Adaboost ensemble method is the best one to identify the minority class (defect-prone) samples when the CK metrics suite is adopted. When all metrics in the dataset are applied, the Adaboost algorithm provides the best performance to identify the majority class instances. In the near future, we will perform more experiments when we access more data balancing methods and more datasets on this issue.

References

[1] C. Catal, B. Diri, A systematic review of software fault prediction studies, Expert Systems with Applications 2009;36(4):7346–7354.

[2] C. Catal, U. Sevim, B. Diri, Metrics-driven software quality prediction without prior fault data, Electronic Engineering and Computing Technology. Springer; 2010:189–199.

[3] C. Catal, A comparison of semi-supervised classification approaches for software defect prediction, Journal of Intelligent Systems 2014;23(1):75–82.

[4] T. Hall, S. Beecham, D. Bowes, D. Gray, S. Counsell, A systematic literature review on fault prediction performance in software engineering, IEEE Transactions on Software Engineering 2012;38(6):1276–1304.

[5] M. Sokolova, N. Japkowicz, S. Szpakowicz, Beyond accuracy, F-score and ROC: a family of discriminant measures for performance evaluation, Australian Conference on Artificial Intelligence. 2006.

[6] S. Wang, X. Yao, Using class imbalance learning for software defect prediction, IEEE Transactions on Reliability 2013;62(2):434–443 10.1109/TR.2013.2259203.

[7] H. He, E.A. Garcia, Learning from imbalanced data, IEEE Transactions on Knowledge and Data Engineering 2008;9:1263–1284.

[8] H. He, Y. Ma, Imbalanced Learning: Foundations, Algorithms, and Applications. John Wiley & Sons; 2013.

[9] Q. Song, Y. Guo, M. Shepperd, A comprehensive investigation of the role of imbalanced learning for software defect prediction, IEEE Transactions on Software Engineering 2018 10.1109/TSE.2018.2836442 in press.

[10] N.V. Chawla, K.W. Bowyer, L.O. Hall, W.P. Kegelmeyer, Smote: synthetic minority over-sampling technique, Journal of Artificial Intelligence Research 2002;16:321–357.

[11] M. Galar, A. Fernandez, E. Barrenechea, H. Bustince, F. Herrera, A review on ensembles for the class imbalance problem: bagging-, boosting-, and hybrid-based approaches, IEEE Transactions on Systems, Man and Cybernetics. Part C, Applications and Reviews 2012;42(4):463–484.

[12] E. Ramentol, Y. Caballero, R. Bello, F. Herrera, SMOTE-RSB*: a hybrid preprocessing approach based on oversampling and undersampling for high imbalanced data-sets using SMOTE and rough sets theory, Knowledge and Information Systems 2012;33(2):245–265.

[13] C. Elkan, The foundations of cost-sensitive learning, International Joint Conference on Artificial Intelligence. Lawrence Erlbaum Associates Ltd; 2001;vol. 17:973–978.

[14] C. Zhang, Y. Ma, Ensemble Machine Learning: Methods and Applications. Springer; 2012.

[15] L. Breiman, Bagging predictors, Machine Learning 1996;24(2):123–140.

[16] Y. Freund, R.E. Schapire, A decision-theoretic generalization of on-line learning and an application to boosting, Journal of Computer and System Sciences 1997;55(1):119–139.

[17] X. Wu, V. Kumar, J.R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G.J. McLachlan, A. Ng, B. Liu, S.Y. Philip, et al., Top 10 algorithms in data mining, Knowledge and Information Systems 2008;14(1):1–37.

[18] R. Barandela, R.M. Valdovinos, J.S. Sánchez, New applications of ensembles of classifiers, Pattern Analysis & Applications 2003;6(3):245–256.

[19] S. Wang, X. Yao, Diversity analysis on imbalanced data sets by using ensemble models, Computational Intelligence and Data Mining, 2009. CIDM'09. IEEE Symposium on. IEEE; 2009:324–331.

[20] C. Seiffert, T.M. Khoshgoftaar, J. Van Hulse, A. Napolitano, Rusboost: a hybrid approach to alleviating class imbalance, IEEE Transactions on Systems, Man and Cybernetics. Part A. Systems and Humans 2010;40(1):185–197.

[21] N.V. Chawla, A. Lazarevic, L.O. Hall, K.W. Bowyer, Smoteboost: improving prediction of the minority class in boosting, European Conference on Principles of Data Mining and Knowledge Discovery. Springer; 2003:107–119.

[22] I. Barandiaran, The random subspace method for constructing decision forests, IEEE Transactions on Pattern Analysis and Machine Intelligence 1998;20(8):1–22.

[23] Y. Freund, R. Schapire, N. Abe, A short introduction to boosting, Journal of Japanese Society for Artificial Intelligence 1999;14(771–780):1612.

[24] J. Friedman, T. Hastie, R. Tibshirani, The Elements of Statistical Learning, vol. 1. Springer Series in Statistics. New York, NY, USA: Springer New York Inc.; 2001.

[25] R.A. Fisher, The use of multiple measurements in taxonomic problems, Annals of Eugenics 1936;7(2):179–188.

[26] K. Hechenbichler, K. Schliep, Weighted k-Nearest-Neighbor Techniques and Ordinal Classification. [Discussion Paper 399, SFB 386] Ludwig-Maximilians University Munich; 2004. URL http://nbn-resolving.de/urn/resolver.pl?urn=nbn:de:bvb:19-epub-1769-9.

[27] M. Jureczko, L. Madeyski, Towards identifying software project clusters with regard to defect prediction, Proceedings of the 6th International Conference on Predictive Models in Software Engineering. PROMISE '10. New York, NY, USA: ACM; 2010, 9 10.1145/1868328.1868342. URL http://doi.acm.org/10.1145/1868328.1868342.

[28] W. Claes, R. Per, H. Martin, C. Magnus, R. Björn, A. Wesslén, Experimentation in software engineering: an introduction, Available online: http://books.google.com/books.

[29] J. Nam, W. Fu, S. Kim, T. Menzies, L. Tan, Heterogeneous defect prediction, IEEE Transactions on Software Engineering.

[30] K.E. Bennin, J. Keung, P. Phannachitta, A. Monden, S. Mensah, Mahakil: diversity based oversampling approach to alleviate the class imbalance issue in software defect prediction, IEEE Transactions on Software Engineering 2018;44(6):534–550.

[31] A. Agrawal, T. Menzies, Is better data better than better data miners?: on the benefits of tuning smote for defect prediction, Proceedings of the 40th International Conference on Software Engineering. ACM; 2018:1050–1061.

[32] D. Bowes, T. Hall, J. Petrić, Software defect prediction: do different classifiers find the same defects? Software Quality Journal 2018;26(2):525–552.

[33] X. Yu, M. Wu, Y. Jian, K.E. Bennin, M. Fu, C. Ma, Cross-company defect prediction via semi-supervised clustering-based data filtering and MSTrA-based transfer learning, Soft Computing 2018;22(10):3461–3472.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.140.188.241