Chapter 14: Interpretable semisupervised classifier for predicting cancer stages

Isel Graua; Dipankar Senguptaa,b; Ann Nowea    a Artificial Intelligence Lab, Free University of Brussels (VUB), Brussels, Belgium
b PGJCCR, Queens University Belfast, Belfast, United Kingdom

Abstract

Machine learning techniques in medicine have been at the forefront addressing challenges such as diagnosis, prognosis prediction, or precision medicine. In this field, the data are sometimes abundant but comes from different data sources or lack assigned labels. The process of manually labeling these data when conforming to a curated dataset for supervised classification can be costly. Semisupervised classification offers a wide range of methods for leveraging unlabeled data when learning prediction models. However, these classifiers are commonly deep or ensemble learning structures that often result in black boxes. The requirement of interpretable models for medical settings led us to propose the self-labeling gray box classifier, which outperforms other semisupervised classifiers on benchmarking datasets while providing interpretability. In this chapter, we illustrate the applications of the self-labeling gray box on the omics and clinical datasets from the cancer genome atlas. We show that the self-labeling gray box is accurate in predicting cancer stages of rare cancers by leveraging the unlabeled instances from more common cancer types. We discuss insights, the features influencing prediction, and a global representation of the knowledge through decision trees or rule lists, which can aid clinicians and researchers.

Keywords

Cancer stage prediction; Explainable artificial intelligence; Semisupervised classifier; Self-labeling; Gray box model

Acknowledgments

This work was supported by the Flemish Government (AI Research Program); the IMAGica project, financed by the Interdisciplinary Research Programs and Platforms (IRP) funds of the Vrije Universiteit Brussel; and the BRIGHT analysis project, funded by the European Regional Development Fund (ERDF) and the Brussels-Capital Region as part of the 2014–20 operational program through the F11-08 project ICITY-RDI.BRU (icity.brussels).

1: Introduction

Cancer is a disease or group of diseases caused by the transformation of normal cells into tumor cells characterized by their uncontrolled growth. This is a multistage process, triggered and regulated by complex and heterogeneous biological causes (Hausman, 2019). In the process, there is a gradual invasion and destruction of healthy cells, tissues, and organs by the cancerous cells (Hausman, 2019). Therefore, a key factor in the diagnosis of cancer is identifying the extent it has spread across the body: stage and TNM grade (tumor, node, metastasis) (Gress et al., 2017). This is also important for treatment planning and patient prognosis. Clinically, the cancer stage describes the size of the tumor and how far it has spread in the body, whereas the grade of a cancer describes its growth rate, i.e., how rapidly it is spreading in the body. Usually, an initial clinical staging is made based on the laboratory (blood, histology, risk factors) and imaging (X-ray, CT scans, MRI) tests, while a more accurate pathological staging is usually performed postsurgery or via biopsy.

Clinical advancements including computational approaches based on machine learning have been developed since the 1980s, which can be used for cancer detection, classification, diagnosis, and prognosis (Cruz and Wishart, 2006; Kourou et al., 2015). With the advancement of omics-based technologies and availability of the omics data (e.g., genome, exome, proteome, etc.) along with the clinical data, there have been impeccable improvements in these methods (Zhang et al., 2015; Zhu et al., 2020). However, mostly these developments have been for common cancer types (colon, breast, prostate, etc.), like the prostrate pathological stage predictor based on biopsy patterns, PSA (prostate-specific antigen) level, and other clinical factors (Cosma et al., 2016). In similar terms, there are staging predictors available for breast and colon cancer based on clinical factors (Said et al., 2018; Taniguchi et al., 2019). There are more than 200 types of cancer developing from different types of cells in the body; lung cancer being the most common (11.6% of total cases, 18.4% of total cancer-related deaths), followed by breast, colorectal, and prostate cancer (Bray et al., 2018; WHO, 2020). However, 27% of the cancer types, like bladder cancer, melanoma, are less common, whereas 20% of them, like thyroid cancer, acute lymphoblastic leukemia, are rare or very rare (Macmillan Cancer Support, 2020; Cancer Research UK, 2020). A major challenge with such rare cancer types is the availability of data, as they have an incidence rate of 6:100,000. The prediction performance of machine learning approaches for classification, diagnosis, prognosis, etc., involving rare cancers is thus limited by the lack of labeled data.

Semisupervised classification (SSC) constitutes an alternative approach for building prediction models in settings where labeled data are limited. The general aim of SSC is improving the generalization ability of the predictor compared to learn a supervised model using labeled data alone. The main assumption of SCC is that the underlying marginal distribution of instances over the feature space provides information on the joint distribution of instances and their class label, from where the labeled instances were sampled. When this condition is met, it is possible to use the unlabeled data for gaining information about the distribution of instances and therefore also the joint distribution of instances and labels (Zhu et al., 2020). SSC methods available in the literature are based on different views of this assumption. For example, graph-based methods (Blum and Chawla, 2001) assume label smoothness in clusters of instances, i.e., two similar instances will share their label, therefore an unlabeled instance can take the label of its neighbors and propagate this label to other neighboring instances. Semisupervised support vector machines (Joachims, 1999) assume that the boundaries should be placed on low-density areas, which is complementary to the cluster view described earlier. In this method, unlabeled instances help compute better margins for placing the boundaries. Generative mixture models (Goldberg et al., 2009) try to find a mixture of distributions (e.g., Gaussian distributions), where each distribution represents a class label. They learn the joint probability by assuming a type of distribution and adjusting its parameters using information from the labeled and the unlabeled data together. Finally, self-labeling methods use an ensemble of classifiers trained on the available labeled data for assigning labels to the unlabeled instances, assuming their classifications are correct. This assumption makes self-labeling the simplest and most versatile family of semisupervised classifiers, since they can be used with practically any base supervised classifier (Van Engelen and Hoos, 2020). Although SSC methods achieve very attractive performance in terms of accuracy in a wide variety of problems (Triguero et al., 2015), they often result in complex structures which lead to black boxes in terms of interpretability.

Nowadays, an increasing requirement in the application of machine learning is to obtain not only precise models but also interpretable ones. Interpretability is a fundamental tool for gaining insights into how an algorithm produces a particular outcome and attaining the trust of end users. Although several formalizations exist (Barredo Arrieta et al., 2020; Doshi-Velez and Kim, 2017; Lipton, 2016), interpretability is directly connected to the transparency of the machine models obtained. The transparency spectrum (see Fig. 1) starts from completely black box models which involve deep or ensemble structures that cannot be decomposed and mapped to the problem domain. While on the opposite extreme are the white box models which are built based on laws and principles of the problem domain. This side also includes those models which are built from data, but their structure allows for interpretation, since pure white boxes rarely exist (Nelles, 2001).

Fig. 1
Fig. 1 Fictional plot representing the trade-off between accuracy and interpretability for most known machine learning families of models. The transparency spectrum is depicted along the x axis. Inspired from similar figure published by Barredo Arrieta et al. (2020).

White box techniques are commonly referred as intrinsically interpretable and vary in the types of interpretations they can provide as well as their limitations for prediction. Examples of intrinsic interpretable methods are linear and logistic regression (Hastie et al., 2008), k-nearest neighbors, naïve Bayes (Altman, 1992), decision trees (Quinlan, 1993), and decision lists (Cohen, 1995; Frank and Witten, 1998). On the opposite side, black boxes are normally more accurate techniques that learn exclusively from data, but they are not easily understandable at a global level. As a solution, there exist several model-agnostic post hoc methods for generating explanations which quantify the feature attribution in the prediction of certain outcomes according to the black box. Some examples of these techniques are dependency plots (Friedman, 2001), feature importance metrics (Breiman, 2001), local surrogates (LIME) (Ribeiro et al., 2016), or Shapley values (Lundberg and Lee, 2017; Shapley, 1953). While explanations provided by intrinsically interpretable models are derived from their structure and easily mappable to the problem domain, model-agnostic ones are often local or limited to feature attribution rather than a holistic view of the model.

Global surrogates or gray box models take the best of both worlds while trying to find a suitable trade-off between accuracy and interpretability. The idea behind this technique is to distill the knowledge of a previously trained black box model in an intrinsically interpretable one. In this way, the prediction capabilities are kept to some extent by the black box component, while the white box learns to mimic these predictions through a more transparent structure. In our earlier work (Grau et al., 2018, 2020a), we proposed a gray box model for SSC settings, called self-labeling gray box (SLGB). Our method uses the self-labeling strategy from SSC for assigning a label to unlabeled instances. This part of the learning process is carried out by the more accurate black box component. Once all instances have been self-labeled, then a white box model is learned from the enlarged dataset. The white box component, being an intrinsically interpretable classifier, allows for interpretable representation of the model at a global level as well as individual explanations for the prediction of instances. The SLGB outperforms several state-of-the-art SSC algorithms in a wide benchmark of structured classification problems where labeled instances are scarce (Grau et al., 2020b).

In this chapter, we illustrate the applications of self-labeling gray box models on the proteomic (reverse phase protein array) (Li et al., 2013) and the clinical dataset (Liu et al., 2018; Weinstein et al., 2013) for breast (common cancer), esophageal (less common), and thyroid (rare) cancer. In comparison to other omics datasets, we are considering the proteomics data for this study, as the activity of protein is a more relevant phenotype than its expression during pathogenesis (Lim, 2005). The target feature to predict is the cancer stage of the patient. We first study how the inclusion of features from both dimensions (clinical and proteomics) influences the prediction performance. Second, we test how accurate is the SLGB classifier when leveraging unlabeled data for predicting cancer stages. Third, we test how adding unlabeled data from more frequent types of cancers helps in the stage prediction of less common or rare cancer types. Through the experiments section, we illustrate with our interpretable semisupervised classifier, why certain cancer stages are predicted, and which information is important for predictions.

The rest of this chapter is structured as follows. Section 2 describes the SLGB approach with details on their components and learning algorithm. Section 3 describes the preprocessing steps carried out for conforming the datasets used in the analysis. Section 4 discusses the experimental results in different settings, which cover both the performance and interpretability angles. Section 5 formalizes the concluding remarks and research directions to be explored in the future.

2: Self-labeling gray box

In supervised classification, data points or instances x ∈ X are described by a set of attributes or features A and a decision label y ∈ Y. A function f : X → Y is learned from data by relying on pairs of previously labeled examples (xy). Later, the function f can be used for predicting the label of unseen instances.

When the labeled pairs (xy) are limited, SSC uses both labeled and unlabeled instances for the learning process with the aim of improving the generalization ability. In an SSC setting, a set L ⊂ X denotes the instances which are associated with their respective class labels in Y and a set U ⊂ X represent the unlabeled instances, where usually ∣ L ∣  <  ∣ U ∣. A semisupervised classifier will try to learn a function g : L ∪ U → Y for predicting the class label of any instance, leveraging both labeled and unlabeled data.

Self-labeling is a family of SSC methods which uses one or more base classifiers for learning a supervised model that later predicts the unlabeled data, assuming the first predictions are correct to some extent. In the self-labeling process, instances can be added to the enlarged dataset incrementally or with an amending procedure (Triguero et al., 2015). The amending procedures select or weight the self-labeled instances which will enlarge the labeled dataset, to avoid the propagation of misclassification errors.

The SLGB method (Grau et al., 2018) combines the self-labeling strategy of SSC with the global surrogate idea from explainable artificial intelligence in one model. SLGB first trains a black box classifier to predict the decision class, based on the labeled instances available. The black box is exploited in the self-labeling step for assigning labels to the unlabeled instances. Once the enlarged dataset is entirely labeled, a surrogate white box classifier is trained for mimicking the predictions made by the black box. The aim is to obtain better performance than the base white box component, while maintaining a good balance between performance and interpretability. The blueprint of the SLGB classifier is depicted in Fig. 2.

Fig. 2
Fig. 2 Blueprint of the SLGB architecture using amending procedures for correcting the influence of the misclassifications from the self-labeling process.

To avoid the propagation of misclassification errors during the self-labeling, SLGB uses an amending procedure proposed by Grau et al. (2020a). The amending strategy of SLGB is based on a measure of inconsistency in the classification. This type of uncertainty emerges when very similar instances have different class labels, which can result from errors in the self-labeling process. For measuring inconsistency in the classification across the dataset we rely on rough set theory (Pawlak, 1982), a mathematical formalism for describing any set of objects in terms of their lower and upper approximations. In this context, an object would be an instance of the dataset, described by its attributes. The sets would be the decision classes that group these instances. The lower approximation of a given set would be all those instances that for sure are correctly classified in that class, while the upper approximation would contain instances that might belong to that class. From the lower and upper approximations of each set, positive, boundary, and negative regions of each decision class are computed. All instances in the positive region of a class are certainly classified as that class. Likewise, all instances in the negative region of a class are certainly not labeled as the given class. However, the boundary region of a decision is formed by instances that might belong to the class but are not certain. An inclusion degree measure, computed using information from these regions and similar instances (Grau et al., 2020b) is used as an indicative of how certain a prediction from the self-labeling process is. The white box component then focuses on learning from the most confident instances without ignoring the less confident ones coming from the boundary regions. This amending procedure not only improves the accuracy of SLGB, but also increases the interpretability of the surrogate white box by keeping the transparency of the white box component (Grau et al., 2020a).

The SLGB method is a general framework which is flexible for the choice of black box and white box components. In this work, random forest (Breiman, 2001) will be used as black box base classifier. Random forest is an ensemble of decision trees built from a random subset of attributes which uses bagging (Breiman, 1996) technique for aggregating the results of individual classifiers. The choice of this method as black box is supported by its well-known performance in supervised classification (Fernández-Delgado et al., 2014; Wainberg and Frey, 2016; Zhang et al., 2017) and particularly as a black box component for SLGB (Grau et al., 2020b).

Likewise, we will explore the use of several intrinsically interpretable classifiers that produces explanations in the form of if-then rules. In these rules, the condition is a conjunction of feature evaluations and the conclusion is the prediction of the target value. A first option for white box is decision trees learned using the C4.5 algorithm (Quinlan, 1993), which produces a tree-like structure that offers a global view of the model. The most informative attributes are chosen greedily by C4.5 for splitting the dataset on each node of the tree. In this way, the error is minimized when all instances are covered in the leaves of the tree. Decision trees are considered transparent since when traversing the tree to infer the classification of an individual instance, it produces an if-then rule which constitutes an explanation of the obtained classification.

A second option as white box component is decision lists of rules. In this chapter, we explore two mainstream algorithms for generating decision lists using sequential covering: partial decision trees (PART) (Frank and Witten, 1998) and repeated incremental pruning to produce error reduction (RIPPER) (Cohen, 1995). Sequential covering is a common divide-and-conquer strategy for building decision lists. These algorithms induce a rule from data and remove the covered instances before inducing the next rule, until all instances are covered by rules or a default rule is needed. Therefore, the set of rules of a decision list must be interpreted in order. PART decision lists in one of the many models implementing this strategy, where rules are iteratively induced as the most covered one from a pruned C4.5 decision tree. RIPPER is another representative algorithm that uses reduced error pruning and an optimization strategy to revise the induced rules, generally producing more compact sets. Like decision trees, decision lists are transparent and easily decomposable since the explanations that can be generated are rules using features and values of the problem domain.

3: Data preparation

In 12 years, the cancer genome atlas project has collected and analyzed over 20,000 samples from more than 11,000 patients with different types of cancers (Liu et al., 2018; Weinstein et al., 2013). The data in this repository are publicly available and broadly comprise genomic, epigenomic, transcriptomic, clinical (Liu et al., 2018), and proteomic (Li et al., 2013) data. In this chapter, we have focused our experiments on the prediction of the cancer stage based on two data dimensions: clinical and protein expression. We chose three types of cancers for our exploratory experiments: breast (common), esophageal (less common), and thyroid (rare) cancers.

The clinical data used in this study were downloaded from the cancer genome atlasa (Liu et al., 2018). In the study, we have used radiation and drug treatment information from these data. Features describing the radiation treatment include its type (i.e., external or internal), the received dose measured in grays (Gy), the site of radiation treatment (e.g., primary tumor field, regional site, distant recurrence, local recurrence, or distant site) and the response to the treatment by the patient. While features of drug treatment include the type of drug therapy (e.g., hormone therapy, chemotherapy, targeted molecular therapy, ancillary therapy, immunotherapy, vaccine, or others), the total dose, the route of administration, and the response measure to the treatment. In case a patient had more than one record for treatments, all instances are been considered. In addition, we include the age of the patient at the first event of the pathologic stage diagnosis.

The protein expression data for the three cancer types were downloaded from the cancer proteome atlasb (Li et al., 2013). We have used the level 4 (L4) reverse phase protein array (RPPA) data for analysis, as batch effects are been removed in L4 (Li et al., 2013). Each of these datasets have the protein expression values estimated by the RPPA high-throughput antibody-based technique, for the key proteins involved in regulation of that cancer type. It also includes the phosphoproteins, i.e., the proteins which are phosphorylated in the posttranslational processes. For example, AKT_pS473 is a phosphorylated form of AKT (serine-threonine protein kinase), having phosphorylation at an amino acid position of 473. In cancer regulation and many other diseases, the posttranslational modifications like phosphorylation, degradation, and glycosylation play a key role, for example, the role of tyrosine phosphorylation is well established in cancer biology (Lim, 2005). Therefore, all the proteins including the phosphoproteins were considered for the experiments. Data for all the phosphoproteins were normalized by subtracting their expression values from their respective parent protein. For example, AKT being the parent protein for AKT-473, to obtain the relevant phosphorylation score we compute the difference between AKT_pS473 and AKT. The phosphoproteins which did not have their parent protein expression values in the dataset have not been considered in the experiments, as they cannot be normalized.

The clinical features are stored at a patient level, whereas the protein expression data are stored at a sample level. Therefore, the patient identification was used to match each sample characterization to the corresponding patient.

The pathological stage of the patient constitutes the target feature to be predicted. After preprocessing and cleaning, a total of 3073 samples from 1789 patients are included in our experiments. The distribution of patients per type of cancer can be seen in Fig. 3, as well as the distribution of cancer stages across all types of cancers in Fig. 4. The last figure reveals imbalance in the dataset with a majority of patients labeled as stage IIA. While gathering information from the different sources of data, not all patients have information available for all the features, therefore the datasets contain missing values for some. Missing values are also present in the target feature cancer stage, leading to unlabeled instances that will be leveraged for semisupervised classification. For those patients where more than one recorded stage of the same type of cancer is available, we kept the most advanced one.

Fig. 3
Fig. 3 Distribution of cancer types across patients in the dataset, showing breast cancer with the maximum number of instances. BRCA, breast cancer; ESCA, esophageal cancer; THCA, thyroid cancer.
Fig. 4
Fig. 4 Distribution of cancer stages across patients in the dataset shows high imbalance and stage IIA as the most common stage.

4: Experiments and discussion

In this section, we explore the cancer stage prediction problem for breast, esophagus, and thyroid cancer through different settings. We first explore the baseline predictions obtained by the black box and white box component classifiers when working on only the labeled data. We show the influence of adding the proteomic dimension to the clinical data for the stage prediction. Second, we explore how the unlabeled data help in the semisupervised setting where all cancer types are used to build the model. Finally, we explore how unlabeled data coming from more common cancer types can help in predicting the cancer stage of more rare ones.

For the validation of our experiments, a leave one group out cross-validation was used. In this type of cross-validation, the dataset was divided in 10 disjoint groups of patients for avoiding using samples of the same patient for training and testing. Patients with missing cancer stage are not included in the test sets, they are only added to the training sets when semisupervised classification is performed. Notice that one patient can have more than one sample record and each sample constitutes an instance in the dataset (see Tables 14 for the number of instances).

Table 1

Classification performance change with incrementing features for random forests (RF) when using clinical and proteomic data, for each cancer type and the entire dataset.
blank cellInstancesFeaturesRF
ACCKAPPASENSPEAUC
BR1898Clinical0.440.240.450.790.69
blank cell+ proteomic0.810.750.810.930.94
ES124Clinical0.250.010.250.730.51
blank cell+ proteomic0.410.170.410.750.61
TH383Clinical0.600.360.600.840.77
blank cell+ proteomic0.600.240.600.610.76
All2405Clinical0.430.300.430.870.74
blank cell+ proteomic0.750.690.750.920.94

Table 1

Table 2

Classification performance change with incrementing features for decision tree (C4.5) when using clinical and proteomic data, for each cancer type and the entire dataset.
blank cellInstancesFeaturesC4.5
ACCKAPPASENSPEAUCRules
BR1898Clinical0.460.250.460.790.7094
blank cell+ proteomic0.750.680.750.940.86222
ES124Clinical0.30− 0.010.300.690.474
blank cell+ proteomic0.230.060.230.840.4628
TH383Clinical0.650.450.650.900.774
blank cell+ proteomic0.590.370.590.960.6440
All2405Clinical0.510.390.510.880.78299
blank cell+ proteomic0.670.610.670.930.82325

Table 2

Table 3

Classification performance change with incrementing features for decision lists (PART) when using clinical and proteomic data, for each cancer type and the entire dataset.
blank cellInstancesFeaturesPART
ACCKAPPASENSPEAUCRules
BR1898Clinical0.390.140.390.750.6178
blank cell+ proteomic0.770.700.770.940.86149
ES124Clinical0.29− 0.040.290.680.4810
blank cell+ proteomic0.280.110.280.840.4517
TH383Clinical0.650.460.650.900.7812
blank cell+ proteomic0.620.410.620.850.6728
All2405Clinical0.460.340.460.870.73277
blank cell+ proteomic0.680.610.680.930.80229

Table 3

Table 4

Classification performance change with incrementing features for decision lists (RIPPER) when using clinical and proteomic data, for each cancer type and the entire dataset.
blank cellInstancesFeaturesRIPPER
ACCKAPPASENSPEAUCRules
BR1898Clinical0.430.220.430.770.6225
blank cell+ proteomic0.720.630.720.920.8460
ES124Clinical0.330.000.330.660.432
blank cell+ proteomic0.330.050.330.720.455
TH383Clinical0.650.370.650.720.643
blank cell+ proteomic0.630.360.630.750.636
All2405Clinical0.410.250.410.820.6632
blank cell+ proteomic0.640.560.640.910.8375

Table 4

The Weka library (Hall et al., 2009) was used for the implementation of random forests, decision trees, and decision lists algorithms.c Random forests consist of 100 decision trees built using a random subset of features with cardinality equal to the base-2 logarithm of the number of features. Decision trees and PART decision lists use C4.5 algorithm for generating the trees, with a parameter C = 0.25 which denotes the confidence value for performing pruning in the trees (the lower the value, the more pruning is performed). The minimum number of instances on each leaf is two. In RIPPER implementation, the training data are split into a growing set and a pruning set for performing reduced error pruning. The rule set formed from the growing set is simplified with pruning operations optimizing the error on the pruning set. The minimum allowed support of a rule is two and the data are split in three folds where one is used for pruning. Additionally, the number of optimization iterations is set to two.

Given the nature of the target attribute, the prediction problem at hand is not only an imbalance multiclass classification problem, but it is also an ordinal one. The traditional approach to deal with ordinal classification is coding the decision class into numeric values and using a regression model for the prediction. However, this limits the choice of black box and white box components to regression techniques only. Instead we use the approach described by Frank and Hall (2001), which does not require any modification of the underlying prediction algorithm. With this technique, our multiclass classification problem is transformed in several binary classifications datasets where each predictor determines the probability of the class value being greater than a given label and the greatest probability is taken as the decision. This transformation is only applied in the black box component of the SLGB without affecting the interpretability of the white box component.

4.1: Influence of clinical and proteomic data on the prediction of cancer stage

In this subsection, we explore the baseline performance of the classifiers that will be later used as components of the SLGB method. This evaluation is performed in a supervised setting, i.e., only the labeled information is considered. First, we evaluate the performance of random forests (RF), decision trees (C4.5), and decision lists algorithms PART and RIPPER on the classification of cancer stages based on the clinical data only. Later, we add the protein features for comparing how much the proteomic data brings in terms of performance. We perform this analysis for each cancer type: breast (BR), esophagus (ES), and thyroid (TH), and additionally for the entire dataset. Tables 14 show the results using different performance metrics. Accuracy (ACC) shows the proportion of correctly classified instances, while kappa (KAPPA) (Cohen, 1960) considers the agreement occurring by chance. This makes this measure more robust in presence of class imbalance. Other measures such as sensitivity (SEN), specificity (SPE), and area under the receiver operating characteristic curve (AUC) are also included. Since the prediction problem at hand is a multiclass classification problem, the last three measures are weighted averages of these measures for each class label. For the white box classifiers, the number of rules is measured as an indication of the size of the structure and its simplicity. The number of rules is measured for the model built based on all instances instead of individual cross-validation folds.

From the Tables 14, we can conclude that adding proteomic information to the clinical data substantially improves the accuracy of all classifiers in the datasets, and more evidently for breast cancer. Looking at the performance across classifiers, random forests achieve the best results in terms of accuracy. Its high kappa values indicate that despite the class imbalance, the random forest can generalize further than predicting majority classes. This is supported by high true positive and true negative rates. Overall, these results make random forests a promising base black box component for self-labeling the unlabeled data in the following experiments.

Regarding the performance of the white box base classifiers, less accuracy compared to RF is observed across datasets, which is an expected result. Nevertheless, the accuracy values obtained for the entire dataset by the three white box methods are greater than 0.64 and supported by high kappa, sensitivity, specificity, and AUC values. Regarding the number of rules, C4.5 being the most accurate comes with the largest number of rules, followed by PART. RIPPER obtains slightly less accurate results with the largest reduction in the number of rules and therefore the most transparent classifier. However, the interpretation of these three white boxes differ and can be exploited according to the needs of the user.

Comparing the results across different types of cancers, there is evidence that the limitation in data of esophagus and thyroid cancer leads to poor performance. This contrasts with the performance of the predictors trained on breast cancer data which is more abundant and better balanced across classes. In the next section, we join the data for all types of cancers and explore whether the SLGB can obtain a trade-off between performance and interpretability in the semisupervised setting.

4.2: Influence of unlabeled data on the prediction of cancer stage

In this section, we explore the performance of SLGB in the semisupervised prediction of the cancer stage. This time we incorporate 668 unlabeled instances to the learning process, in addition to the 2405 labeled ones. As stated earlier, RF will be used as base classifier for the black box component. A weighting process based on rough sets theory measures is used for amending the errors in the self-labeling process. The three white boxes presented earlier will be explored comparing their performance and interpretability. Table 5 summarizes the experiments results.

Table 5

Classification performance and interpretability of the SLGB classifier using different white box classifiers.
blank cellACCKAPPASENSPEAUCRules
SLGB (RF-C4.5)0.700.620.700.920.85235
SLGB (RF-PART)0.690.600.690.910.82174
SLGB (RF-RIPPER)0.630.520.630.880.8326
C4.50.670.610.670.930.82325
PART0.680.610.680.930.80229
RIPPER0.640.560.640.910.8375

Table 5

The performance results of white boxes from the previous section are summarized for comparison purposes.

Although the number of added unlabeled instances is not large, the SLGB still manages to improve or maintain the performance compared to their base white boxes, while reducing the number of rules needed for achieving this accuracy. The best results are observed with C4.5 as white box, where the accuracy is increased in 0.03 while the number of rules is reduced in 72%, effectively gaining in transparency.

When examining the decision tree generated by the gray box model (see pruned first levels of the tree in Fig. 5), the most informative attributes detected are the proteins FASN, EIF4G, TIGAR, ADAR1, and the clinical feature “age of the initial pathologic diagnostic.” High levels of expression of FASN (fatty acid synthase) protein has been associated through several studies with the later stages of cancer, predicting poor prognosis for breast cancer among others (Buckley et al., 2017). Overexpression of EIF4G is associated with malignant transformation (Bauer et al., 2002; Fukuchi-Shimogori et al., 1997). TIGAR expression regulates the p53 tumor suppressor protein which prevents cancer development through various mechanisms (Bensaad et al., 2006; Green and Chipuk, 2006; Won et al., 2012). ADAR1 has demonstrated functional role in the RNA editing in thyroid cancer (Ramírez-Moya and Santisteban, 2020; Xu et al., 2018). PART and RIPPER rules (see Figs. 6 and 7) associate these and other features to the stages of cancer. While PART exhibits its most confident and supported rules first, RIPPER focuses in predicting the minority class. Therefore, the choice of which decision list to use must come from the need of obtaining explanations about the most common patterns or the rarest ones. These known associations support the rules learned by the machine learning models which provide potential relations that need to be further analyzed and validated clinically.

Fig. 5
Fig. 5 First levels of C4.5 decision tree obtained by SLGB for classifying the stage of cancer, using data from the three types of cancer considered in this study.
Fig. 6
Fig. 6 Subset of rules obtained by SLGB using PART algorithm for classifying the stage of cancer, using data from the three types of cancer considered in this study.
Fig. 7
Fig. 7 Subset of rules obtained by SLGB using RIPPER algorithm for classifying the stage of cancer, using data from the three types of cancer considered in this study.

While the SLGB approach is already able to leverage the unlabeled data for improving performance and interpretability, more impressive results are commonly obtained when the number of unlabeled instances is greater than the labeled ones. In the next subsection, we study how unlabeled instances coming from more frequent types of cancers help in the classification of more rare ones.

4.3: Influence of unlabeled data on the prediction of cancer stage for rare cancer types

In this subsection, we study how unlabeled instances coming from a more frequent type of cancer, such as breast, help in the classification of more rare ones. For this setting, we assume that all instances from breast cancer have the cancer stage label missing. In this manner, we are studying whether unlabeled data from breast cancer helps on improving the generalization of the classifier for thyroid and esophagus cancers. Fig. 8 shows the distribution of unlabeled instances per type of cancer in the dataset as used in the previous section (Fig. 8A) and after neglecting the labels of breast cancer for the current experiment (Fig. 8B).

Fig. 8
Fig. 8 Distribution of (A) original unlabeled samples and (B) unlabeled samples (once all labels from breast cancer are neglected) across types of cancer in the dataset. BRCA, breast cancer; ESCA, esophageal cancer; THCA, thyroid cancer.

Next, we test how much the performance of SLGB improves on the classification of rare cancers with regard to its interpretable supervised baseline. Table 6 shows the results of the experiment using several measures of performance and the number of rules generated as an indicative of the complexity of the model. Overall, being a very imbalanced multiclass classification problem, it is challenging to obtain a high performance in terms of accuracy even for RF classifier. Nevertheless, the accuracy obtained for all classifiers is well balanced through classes as evidenced by a fair kappa value and high specificity.

Table 6

Classification performance and interpretability of the SLGB classifier on the cancer stage classification of thyroid and esophagus cancers, using unlabeled data from breast cancer.
blank cellACCKAPPASENSPEAUCRules
SLGB (RF-C4.5)0.570.430.570.890.7380
SLGB (RF-PART)0.570.430.570.900.7347
SLGB (RF-RIPPER)0.560.410.560.890.7213
RF0.610.480.610.900.87
C4.50.520.370.520.900.6674
PART0.480.310.480.880.6751
RIPPER0.520.320.520.830.6512

Table 6

The performance results of base classifiers are shown as a baseline. The best results are highlighted in bold.

From the table we can observe that SLGB clearly outperforms its white box base classifiers for each case, with the biggest improvement using PART decision lists. At the same time, the number of rules is kept reasonably similar, without adding further complexity to the classifier and therefore keeping the transparency to some extent. The best results were obtained by SLGB using PART, and second, using RIPPER, though RIPPER needs a smaller number of rules for achieving the performance. However, the interpretation of these two classifiers differ in their focus, with PART being more appropriate for finding rules in frequent patterns and RIPPER for more rare ones as it starts from the minority class label.

5: Conclusions

In this chapter, we illustrate the application of the interpretable semisupervised classifier SLGB in the prediction of the stage of cancer patients. In a first experiment, the performance of the base classifiers conforming the self-labeling gray box indicated that joining the clinical and proteomic data from cancer patients improves the generalization ability. Later, we empirically demonstrate that the self-labeling gray box is accurate in predicting the stage of cancer by leveraging the unlabeled data already present in the dataset. We extend this experiment in simulating that all data coming from breast cancer is unlabeled, and to study how much the SLGB is able to improve its prediction on less frequent cancers such as thyroid and esophagus. In this setting, the SLGB outperformed its white box baseline classifiers while keeping the transparency (in terms of number of rules) very similar. Using random forests as a black box component and tree different alternatives as white boxes involving decision trees and rule lists allows obtaining interpretable classifiers for different scenarios. We show the form of representation of the patterns extracted by the three different white box techniques, which detect several protein expressions features which are known to play an important role in the progression of cancer. These known associations support the rules learned by the SLGB, providing potential relations that could be further analyzed clinically. In this regard, future research will explore further the validation of the patterns detected by the interpretable models by contrasting the discovered knowledge with experts’ criteria and complement it with other traditional analysis techniques. The current results pave the way for using SLGB as a tool for aiding clinicians in detecting important proteomic and clinical features that contribute to the development of advances stages in cancer.

References

Altman N.S.An introduction to kernel and nearest-neighbor nonparametric regression. Am. Stat. 1992;46:175–185. doi:10.1080/00031305.1992.10475879.

Barredo Arrieta A., Díaz-Rodríguez N., Del Ser J., Bennetot A., Tabik S., Barbado A., Garcia S., Gil-Lopez S., Molina D., Benjamins R., Chatila R., Herrera F. Explainable artificial intelligence (XAI): concepts, taxonomies, opportunities and challenges toward responsible AI. Inf. Fusion. 2020;58:82–115. doi:10.1016/j.inffus.2019.12.012.

Bauer C., Brass N., Diesinger I., Kayser K., Grässer F.A., Meese E.Overexpression of the eukaryotic translation initiation factor 4G (eIF4G-1) in squamous cell lung carcinoma. Int. J. Cancer. 2002;98:181–185. doi:10.1002/ijc.10180.

Bensaad K., Tsuruta A., Selak M.A., Vidal M.N.C., Nakano K., Bartrons R., Gottlieb E., Vousden K.H.TIGAR, a p53-inducible regulator of glycolysis and apoptosis. Cell. 2006;126:107–120. doi:10.1016/j.cell.2006.05.036.

Blum A., Chawla S.Learning From Labeled and Unlabeled Data Using Graph Mincuts. 2001.doi:10.1184/R1/6606860.V1.

Bray F., Ferlay J., Soerjomataram I., Siegel R.L., Torre L.A., Jemal A.Global cancer statistics 2018: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J. Clin. 2018;doi:10.3322/caac.21492.

Breiman L.Bagging predictors. Mach. Learn. 1996;24:123–140. doi:10.1007/BF00058655.

Breiman L. Random forests. Mach. Learn. 2001;45:5–32. doi:10.1023/A:1010933404324.

Buckley D., Duke G., Heuer T.S., O’Farrell M., Wagman A.S., McCulloch W., Kemble G.Fatty acid synthase – modern tumor cell biology insights into a classical oncology target. Pharmacol. Ther. 2017;doi:10.1016/j.pharmthera.2017.02.021.

Cohen J.A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 1960;20:37–46. doi:10.1177/001316446002000104.

Cohen W.W. Fast effective rule induction. In: Prieditis A., Russell S., eds. Machine Learning Proceedings 1995; San Francisco, CA: Elsevier; 1995:115–123. doi:10.1016/b978-1-55860-377-6.50023-2.

Cosma G., Acampora G., Brown D., Rees R.C., Khan M., Pockley A.G.Prediction of pathological stage in patients with prostate cancer: a neuro-fuzzy model. PLoS One. 2016;11:doi:10.1371/journal.pone.0155856.

Cruz J.A., Wishart D.S.Applications of machine learning in cancer prediction and prognosis. Cancer Inform. 2006;2:doi:10.1177/117693510600200030 117693510600200.

Doshi-Velez F., Kim B.Towards a Rigorous Science of Interpretable Machine Learning. arXiv:1702.08608; 2017.1–13.

Fernández-Delgado M., Cernadas E., Barro S.S., Amorim D., Fernandez-Delgado M., Cernadas E., Barro S.S., Amorim D.Do we need hundreds of classifiers to solve real world classification problems. J. Mach. Learn. Res. 2014;15:3133–3181.

Frank E., Hall M.A simple approach to ordinal classification. In: Lecture Notes in Computer Science (Including Subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). Springer Verlag; 2001:145–156. doi:10.1007/3-540-44795-4_13.

Frank E., Witten I.H. Generating accurate rule sets without global optimization. In: Proceedings of the Fifteenth International Conference on Machine Learning, ICML ‘98; San Francisco, CA, USA: University of Waikato, Department of Computer Science; 1998:1-55860-556-8144–151.

Friedman J.H.Greedy function approximation: a gradient boosting machine. Ann. Stat. 2001;1:1189–1232. doi:10.2307/2699986.

Fukuchi-Shimogori T., Ishii I., Kashiwagi K., Mashiba H., Ekimoto H., Igarashi K.Malignant transformation by overproduction of translation initiation factor eIF4G. Cancer Res. 1997;57.

Goldberg X., Zhu X., Goldberg A.Introduction to semi-supervised learning. In: Synthesis Lectures on Artificial Intelligence and Machine Learning. Morgan & Claypool Publishers; 2009:doi:10.2200/S00196ED1V01Y200906AIM006.

Grau I., Sengupta D., Garcia Lorenzo M.M., Nowé A. Interpretable self-labeling semi-supervised classifier. In: IJCAI/ECAI 2018 Workshop on Explainable Artificial Intelligence (XAI); 2018.

Grau I., Sengupta D., Garcia Lorenzo M.M., Nowé A. An interpretable semi-supervised classifier using rough sets for amended self-labeling. In: IEEE International Conference on Fuzzy Systems (FUZZ-IEEE); IEEE; 2020a.

Grau I., Sengupta D., Garcia Lorenzo M.M., Nowé A. An Interpretable Semi-Supervised Classifier Using Two Different Strategies for Amended Self-Labeling. 2020b arXiv Prepr. arXiv2001.09502.

Green D.R., Chipuk J.E.p53 and metabolism: inside the TIGAR. Cell. 2006;doi:10.1016/j.cell.2006.06.032.

Gress D.M., Edge S.B., Greene F.L., Washington M.K., Asare E.A., Brierley J.D., Byrd D.R., Compton C.C., Jessup J.M., Winchester D.P., Amin M.B., Gershenwald J.E.Principles of Cancer Staging. 2017.doi:10.1007/978-3-319-40618-3_1.

Hall M., Frank E., Holmes G., Pfahringer B., Reutemann P., Witten I.H.The WEKA data mining software: an update. ACM SIGKDD Explor. Newsl. 2009;11:10–18.

Hastie T., Tibshirani R., Friedman J.The Elements of Statistical Learning. 2008.doi:10.1007/978-0-387-84858-7.

Hausman D.M. What is cancer?. Perspect. Biol. Med. 2019;62:778–784. doi:10.1353/pbm.2019.0046.

Joachims T.Transductive inference for text classification using support vector machines. In: Proceedings of the 16th International Conference on Machine Learning (ICML); 1999.

Kourou K., Exarchos T.P., Exarchos K.P., Karamouzis M.V., Fotiadis D.I.Machine learning applications in cancer prognosis and prediction. Comput. Struct. Biotechnol. J. 2015;doi:10.1016/j.csbj.2014.11.005.

Li J., Lu Y., Akbani R., Ju Z., Roebuck P.L., Liu W., Yang J.-Y., Broom B.M., Verhaak R.G.W., Kane D.W., Wakefield C., Weinstein J.N., Mills G.B., Liang H. TCPA: a resource for cancer functional proteomics data. Nat. Methods. 2013;doi:10.1038/nmeth.2650.

Lim Y.P. Mining the tumor phosphoproteome for cancer markers. Clin. Cancer Res. 2005;doi:10.1158/1078-0432.CCR-04-2243.

Lipton Z.C.The mythos of model interpretability. In: 2016 ICML Workshop on Human Interpretability in Machine Another Such Divergence of Real-Life and Machine Learning (WHI 2016); Association for Computing Machinery; 2016:96–100.

Liu J., Lichtenberg T., Hoadley K.A., et al. An integrated TCGA pan-cancer clinical data resource to drive high-quality survival outcome analytics. Cell. 2018;doi:10.1016/j.cell.2018.02.052.

Lundberg S.M., Lee S.-I.A unified approach to interpreting model predictions. In: Guyon I., Luxburg U.V., Bengio S., Wallach H., Fergus R., Vishwanathan S., Garnett R., eds. Advances in Neural Information Processing Systems 30. Curran Associates, Inc.; 2017:4765–4774.

Nelles O.Nonlinear System Identification. Berlin, Heidelberg: Springer; 2001.doi:10.1007/978-3-662-04323-3.

Pawlak Z.Rough sets. Int. J. Comput. Inf. Sci. 1982;11:341–356. doi:10.1007/BF01001956.

Quinlan J.R. C4.5: Programs for Machine Learning. San Mateo, CA: Morgan Kaufmann Publishers; 1993.

Ramírez-Moya J., Santisteban P.Commentary: The oncogenic role of ADAR1-mediated RNA editing in thyroid cancer. J. Cancer Biol. 2020;1(1):16–19.

Ribeiro M.T., Singh S., Guestrin C.Model-Agnostic Interpretability of Machine Learning. 2016.

Said A.A., Abd-Elmegid L.A., Kholeif S., Gaber A.A.Stage-specific predictive models for main prognosis measures of breast cancer. Future Comput. Inform. J. 2018;3:391–397. doi:10.1016/j.fcij.2018.11.002.

Shapley L.S.A value for n-person games. In: Kuhn H.W., ed. Contributions to the Theory Games. Princeton, NJ: Princeton University Press; 307–317. 1953;vol. 2.

Taniguchi K., Ota M., Yamada T., Serizawa A., Noguchi T., Amano K., Kotake S., Ito S., Ikari N., Omori A., Yamamoto M.Staging of gastric cancer with the Clinical Stage Prediction score. World J. Surg. Oncol. 2019;17:47. doi:10.1186/s12957-019-1589-5.

Triguero I., García S., Herrera F. Self-labeled techniques for semi-supervised learning: taxonomy, software and empirical study. Knowl. Inf. Syst. 2015;42:245–284. doi:10.1007/s10115-013-0706-y.

Van Engelen J.E., Hoos H.H.A survey on semi-supervised learning. Mach. Learn. 2020;109:373–440. doi:10.1007/s10994-019-05855-6.

Wainberg M., Frey B.J.Are random forests truly the best classifiers?. J. Mach. Learn. Res. 2016;17(110):1–5.

Weinstein J.N., Collisson E.A., Mills G.B., et al. The cancer genome atlas pan-cancer analysis project. Nat. Genet. 2013;doi:10.1038/ng.2764.

Macmillan Cancer Support.Rare cancers. Cancer information and support [WWW Document]. 2020. URL https://www.macmillan.org.uk/cancer-information-and-support/rare-cancers (accessed 3.12.2020).

Cancer Research UK.What is a rare cancer? Rare cancers [WWW document]. 2020. URL https://www.cancerresearchuk.org/about-cancer/rare-cancers/what-rare-cancers-are (accessed 3.12.2020).

WHO.Cancer. Fact Sheets [WWW Document]. 2020. URL https://www.who.int/news-room/fact-sheets/detail/cancer (accessed 3.12.2020).

Won K.Y., Lim S.J., Kim G.Y., Kim Y.W., Han S.A., Song J.Y., Lee D.K.Regulatory role of p53 in cancer metabolism via SCO2 and TIGAR in human breast cancer. Hum. Pathol. 2012;43:221–228. doi:10.1016/j.humpath.2011.04.021.

Xu X., Wang Y., Liang H.The role of A-to-I RNA editing in cancer development. Curr. Opin. Genet. Dev. 2018;doi:10.1016/j.gde.2017.10.009.

Zhang P.W., Chen L., Huang T., Zhang N., Kong X.Y., Cai Y.D.Classifying ten types of major cancers based on reverse phase protein array profiles. PLoS One. 2015;10:doi:10.1371/journal.pone.0123147.

Zhang C., Liu C., Zhang X., Almpanidis G.An up-to-date comparison of state-of-the-art classification algorithms. Expert Syst. Appl. 2017;82:128–150.

Zhu W., Xie L., Han J., Guo X. The application of deep learning in cancer prognosis prediction. Cancers (Basel). 2020;doi:10.3390/cancers12030603.


a https://portal.gdc.cancer.gov/.

b https://www.tcpaportal.org/.

c For the execution of all experiments described in this section, we used a PC with Intel(R) Core(TM) i5-8350U CPU @ 1.70 GHz, 1896 MHz, 4 Cores, 8 Logical Processors, 32.0 GB RAM, and Java Virtual Machine version 8.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
13.59.38.41