Chapter 16: Prediction of leukemia by classification and clustering techniques

Kartik Rawala; Advika Parthvia; Dilip Kumar Choubeyb; Vaibhav Shuklac    a School of Computer Science and Engineering, Vellore Institute of Technology, Vellore, Tamil Nadu, India
b Department of Computer Science & Engineering, Indian Institute of Information Technology, Bhagalpur, India
c Tech Mahindra Ltd., Mumbai, Maharastra, India

Abstract

Leukemia is a kind of blood cancer that impacts the white blood cells and damages the bone marrow. Typically the complete blood count (CBC) and bone marrow are affected. It can be a fatal disease if not identified at the earliest stage. Usually, manual microscopic assessment of stained sample slides is used for analysis of leukemia, but manual diagnostic strategies are time consuming, less accurate, and prone to errors due to diverse human elements such as pressure, fatigue, and so on. To avoid possible faults and errors and to assist pathologists, clustering and classification techniques are required, which are being used in every medical field to obtain better outcomes. This chapter emphasizes clustering and classification techniques applied to detection of leukemia. The proposed work consists of two phases: Phase I deals with the collection of the dataset and visualization of datasets, and Phase II deals with machine learning and data mining techniques for the prediction of leukemia. We would expect that the proposed techniques would show better performance than other existing techniques. The proposed techniques could be utilized for other diseases as well.

Keywords

Leukemia; Machine learning; Data mining; Clustering; Classification; Fuzzy c-means; KNIME; KNN

1: Introduction

Leukemia is a cancerous growth of abnormal white cells that destroys the blood and bone marrow. Leukemia is classified by the kind of white blood cells influenced and by how rapidly the illness advances. Lymphocytic leukemia (otherwise called lymphoid or lymphoblastic leukemia) is created in the white blood cells called lymphocytes in the bone marrow. Myeloid (otherwise called myelogenous) leukemia may likewise begin in white blood cells other than lymphocytes, or in red blood cells and platelets.

Based on how rapidly it advances or deteriorates, leukemia is called either acute (quickly developing) or chronic (slow-developing). Acute leukemia advances rapidly, causing the aggregation of juvenile, functionless cells in the bone marrow. With this sort of leukemia, cells recreate and develop in the marrow, diminishing the marrow’s capacity to deliver enough normal blood cells. Chronic leukemia advances more gradually and results in the aggregation of generally develop, yet at the same time anomalous, white blood cells.

In diagnostic and prediction software for leukemia, different algorithms, such as support vector machine (SVM), k-nearest neighbor (k-NN), k-means, and fuzzy c-means, are implemented on datasets related to leukemia to find those having the best accuracy and the least time complexity, in order to make diagnosis faster, easier, and more accurate. This chapter uses the Konstanz Information Miner (KNIME) platform and other relevant software to implement and compare various algorithms to find the best one. Our future work will deal with analysis of leukemia patients in and around a specific area. We will specify the area having the greatest number of leukemia patients, to assist in planning, since providing a plan for the diagnosis and treatment of cancers is a key component of any overall cancer control plan. Providing doctors, equipment, and appropriate medication where it is most required, instead of distributing these resources randomly, is an important factor.

The chapter objective is to achieve better prediction of leukemia, which is a serious disease that can be cured if treated in earlier stages. The analysis of blood samples is typically done manually to determine if there are any abnormalities in the sample that are indicative of disorders. It is very beneficial for patients to be diagnosed at earlier stages, so they have a possibility of being cured. The mortality rate in India can be reduced to a certain extent if people with leukemia are treated earlier, so their disease does not prove to be fatal. If an efficient technique can be developed for the prediction of leukemia, then it will be easier for physicians to diagnose it.

The rest of the chapter is arranged as follows: motivation is stated in Section 2, a literature review is elaborated in Section 3, a description of the proposed system is provided in Section 4, simulation results and a discussion are given in Section 5, and conclusions and future directions are discussed in Section 6.

2: Motivation

Leukemia is a type of cancer that affects the bone marrow and is considered to be fatal. In spite of advancements in science and technology, a microscopic examination of a blood smear still remains the standard and hence most economical method for leukemia diagnosis. The technique for manual examination relies upon pathologists, that is, their experience, mental status, individual issues, etc. Thus these components can all influence the results. Due to these factors, there needs to be a viable computerized framework for screening of leukemia that yields significantly improved results. Moreover, computerized systems, when contrasted with manual analysis, can increase the precision and the speed of diagnosis. This will assist specialists in treating the disease.

3: Literature review

The authors have carried out a rigorous analysis and study of many research articles based on leukemia with particular focus on classification and clustering algorithms.

Since clustering and classification techniques are now being used in every medical field to obtain better outcomes, this chapter therefore emphasizes these techniques. A group of researchers (Choubey and Paul, 2015; Choubey and Paul, 2016a, b; Choubey and Paul, 2017a, b; Choubey et al., 2017; Choubey et al., 2018; Choubey et al., 2019a, 2019b; Choubey et al., 2020a; Kumar et al., 2020a) have implemented many software computing and computational intelligence methods for the prediction of diabetes. Researchers have also compared and analyzed their proposed algorithms with several existing algorithms on real-world diabetes datasets. They have evaluated the performance of each algorithm and have also discussed the future directions. In this way, Sharma et al. (2020) have discussed computational intelligence techniques for the identification of breast cancer; Parthvi et al. (2020) have done a comparative analysis using machine learning and data-mining techniques for leukemia; Pahari and Choubey (2020) have done a comparative analysis using soft computing approaches for leukemia; Kumar et al. (2018b) and Kumar et al. (2020b) have used multichannel FLANN and cat swarm optimization-based FLANN to eliminate noise from ultrasound images; Srivastava and Choubey (2020), Kumar et al. (2019), and Srivastava and Choubey (2019) have used, analyzed, and compared machine-learning and data-mining techniques for the classification of heart disease, using soft computing; Bala et al. (2017) and Bala et al. (2018) have analyzed and compared soft computing, data mining, and machine-learning techniques for the prediction of thunderstorms and lightning.

Dash et al. (2012) provided a comparison between dimensional reduction techniques like the hybrid feature selection scheme and partial least squares method. In this analysis, the relative performance of four different supervised classification procedures, including radial basis function network (RBFN), was evaluated. The results presented in the paper showed that the appropriate feature selection method was a partial least squares regression method, and a combined use of different classification and feature selection approaches made it possible to construct high-performance classification models for microarray data.

Chandrasekar et al. (2013) have presented an effective classification method. After analyzing different classification algorithms, they choose six classifiers based on simulation performance and the results showed that the random tree classifier algorithm achieved an overall classification accuracy of 98%.

Priyanga and Prakasam (2013) proposed a system called a data mining-based cancer prediction system. The main aim of this model is to give earlier warnings to patients, and it is also of both time and cost benefit to the user. This model predicts specific cancer risk. The system was validated by comparing the patient’s prior medical records with the predicted result given by the model, and also this system was analyzed using the WEKA tool. This prediction system is available online.

Suji and Rajagopalan (2013) used the oral datasets of many cancer and noncancer patients; the collected data was preprocessed for duplicate and missing data. Then various classification algorithms were applied on this preprocessed dataset. The performance of all the algorithms was then analyzed. The obtained result clearly showed that for the C4.5 algorithm, the classification rate reached almost 100%, while the classification rate of the random tree algorithm and MPNN was near 98.7% and 99.5%, respectively.

Sivaraman et al. (2014) proposed a blood cancer prediction system by using a statistical approach with a fuzzy inference system and a feed-forward back-propagation neural network. Their system was implemented on a huge set of test data, and was utilized to analyze the outcomes. The proposed blood cancer prediagnosis system offered significant accuracy, sensitivity, and specificity.

Shouval et al. (2015) proposed a machine-learning algorithm that is part of the data mining (DM) approach, which may serve for transplantation-related mortality risk prediction. In the case of acute lymphocytic leukemia (ALL), the alternate decision tree model provides a robust tool for risk evaluation of patients with this disease. This method has proved useful for clinical prediction in hematopoietic stem-cell transplantation.

Daqqa et al. (2017) presented a study that predicted the existence of leukemia by determining the relationship of blood properties and leukemia to gender, health status of patient, and the age factor, using data mining identified for blood cancer classification of k-nearest neighbor (k-NN), decision tree (DT), and support vector machine (SVM). The study was performed on a dataset of about 4000 patients and the results of the study showed that the decision tree algorithm had the highest percentage in comparision with the other two algorithms. Through this study, it is also clear that the DT classifier obtains properties regarding other attributes such as cities (eastern regions) that are most vulnerable to leukemia.

Kumar et al. (2018a), using python as a key tool and k-nearest neighbor (k-NN) and naïve Bayes classifiers, depicted acceptable performance for the classification of acute leukemia by acquiring microscopic test images.

Panda and Vihar (2016) have used bioinformatics datasets for understanding the effectiveness of a proposed classification, concluding the effectiveness of the proposed approach of combining DCNN by comparing it with an FRF classifier and with the other available research in the relevant domain; they highlight the future scope of the research in their conclusions.

Vasighizaker et al. (2019) used a one-class classification support vector machine (OCSVM) method to classify an acute myeloid leukemia (AML) cancer dataset. The researchers have claimed that, compared with the traditional methods, their proposed method’s experimental results indicate superiority.

Warnat-Herresthal et al. (2020) proposes a data-driven high-dimensional approach in the prediction of leukemia. The approaches used in the study are highly scalable with low marginal cost, essentially matching human expert annotation in a near-automated workflow. The results of the study show that a machine-learning approach with transcriptomics can be used as a part of an integrated omics approach where, in the risk prediction of leukemia, different diagnoses are achieved by genomics, while on the other hand the diagnosis could be assisted by transcriptomic-based machine learning.

Table 1 provides a thorough analysis of different research articles. In the table we have presented the different techniques, datasets, tools used, advantages, issues, and accuracy for cancer diseases.

Table 1

Summary of existing works concerning leukemia.
Authors with yearDatasetsTechniques usedTool usedAdvantagesIssuesAccuracy
(Kumar et al., 2018a)Acquired digital data: Microscopic test imagesK-nearest neighbor (k-NN) and naïve Bayes classifierPythonThe outcomes show that the calculation proposed accomplishes a worthy exhibition for the analysis of intense lymphocytic leukemiaAbsence of forecast model improvement rules, I clung to an exacting methodologic head

80%
(Escalante et al., 2012)Real data such as ALL/AML datasetTwo Bayesian classification methods, which incorporate feature selection, for the classification of gene expression data derived from cDNA microarraysEPSMS is an exceptionally powerful strategy for the computerized development of troupe classifiers for acute leukemia, which requires no noteworthy client mediationThere are still some open issues that call for further examination. One issue is the determination of a lot of classifiers for making a gathering97.68%
(Shouval et al., 2015)Source not provided.The alternating decision tree (ADT) algorithmThe substituting choice tree model provides a strong instrument for the chance assessment of patients with AL before HSCTAbsence of expectation model advancement guidelines, I clung to exacting methodologic principals, as opposed to the EBMT and HCT-CI scores70%
(Li et al., 2016)The proposed method was tested on 130 ALL images taken from ALL IDBComplete methodology is based on the dual-threshold algorithmProposed a double limit strategy for segmenting white blood cells from acute lymphoblastic leukemia imagesWhite blood cell division, which assumes a significant job in programmed cell morphology investigation remains a difficult issue in view of the morphological variety of WBCs and the mind-boggling foundation of blood tiny pictures97.85%
(Sewak et al., 2009)ALL and AML using the microarray gene expression dataThe heuristic nature of machine-learning algorithmsThe advisory group, through a lion’s share casting a ballot, effectively ordered an aggregate of 34 of the 35 approval informational collections, yielding an exactness of 97.14% for the three-class characterization issueAbsence of various kinds of informational indexes utilized97%
(Abdeldaim et al., 2018)ALL-IDB1 and ALL-IDB2K-NN is used for classificationPythonAcceptable in terms of the segmentation performance as the accuracy of all classifiers, especially k-NN, which achieved the best accuracyThe shape features cannot be trusted because of sensitivity to segmentation errors. These features integrate together with regional features, which are less susceptible to errors91%
(Valdés and Barton, 2004)The dataset utilized has 7129 qualities where patients are isolated into a preparation set containing 38 bone marrow testsK-means algorithm, with a Boolean reasoning algorithmRepresentation additionally clarified the conduct of the neural system models and recommends the potential for the presence of better arrangementsThe outcomes clarify the conduct of the neural system models and propose the potential for the presence of better arrangements95%
(Do and Byrd, 2015)The outcomes from mass cytometry were contrasted and clinical stream cytometry information, and the techniques Wrath profoundly reliableK-means algorithmPythonA bewildering amount of data was managed from one lot of bone marrow suctionsAn astonishing amount of information was gathered and can be used for future development96%
(Panda and Vihar, 2016)Uses bioinformatics datasets for understanding the effectiveness of our proposed classification. It uses arrhythmia, leukemia, lymphoma, and prostate cancer datasets for experimentationThe goal is to develop an efficient machine learning algorithm that can help to speed up the classification process and address the memory constraints effectivelyPythonWe conclude with the effectiveness of our proposed approach of combining DCNN with FRF classifier compared with other available research in the relevant domain and highlight the future scope of researchLack of different type of datasets used93.7%
(Shafique and Tehsin, 2018)The dataset used is from Alex-NetZack algorithm is used for each segmentation of leukocytes, SVM classifierRobotized diagnosing framework may assist in early diagnosing of leukemia so it is very well may be dealt with viablyThe speed and working model can be expanded99.5%
(Fuse et al., 2019)Niigata Group, Nagaoka Group datasetThe alternating decision tree (ADTree) algorithm is used for this, one component of the machine-learning (ML) approach based on artificial intelligence (AI)The current outcomes demonstrate that ML, for example, ADTree, will add to the decision-making procedure in the expanded allo-HSCT field and be valuable for forestalling leukemia relapseThe drawback of the current examination is that the volume of patient information for ADTree to learn was generally small90%
(Wang et al., 2005)Microarray dataset (Source not provided)Feature selection algorithmThese AI calculations are actualized in IKA, a freely accessible open-source programming bundle. This product can be utilized by both experienced and inexperienced clientsBecause of their high computational costs, it is difficult to consolidate wrappers with some AI calculations, for example, SMO85%
(Choudhury et al., 2013)Source not providedClassification, Clustering algorithms, Soft computing techniquesIt demonstrates that all public gene expression data are potentially useful for drug discoveryLack of different types of datasets used90%
(Hassane et al., 2008)The microarray gene expression data was obtained from the National Centre for Biotechnology Information (NCBI) Gene Expression Omnibus (GEO)Isolation of CD34 AML total RNA, Microarray hybridization and analysis, Acquisition and processing of public microarray data, Query of the GEO dataIt demonstrates that data are potentially useful for drug discovery and can be accessed by any investigator with the appropriate computational toolsSpeed of prediction and visualization is too slow90%

Table 1Table 1

The results presented in the study show that, with existing technologies, it is potentially possible to achieve good performance in a near-automated fashion.

4: Description of proposed system

The proposed system is described in the following subsections.

4.1: Introduction and related concepts

Leukemia is a cancerous growth of abnormal white cells that damages the blood and bone marrow. Leukemia is classified by the type of white blood cells influenced and by how rapidly the illness advances. Lymphocytic leukemia (otherwise called lymphoid or lymphoblastic leukemia) appears in the white blood cells, called lymphocytes, in the bone marrow. Myeloid (otherwise called myelogenous) leukemia may likewise begin in blood cells other than lymphocytes, such as red blood cells and platelets.

Based on how rapidly it appears or advances, leukemia is called either acute (quickly developing) or chronic (slow-developing). Acute leukemia advances quickly and brings about the aggregation of juvenile, nonfunctioning blood cells in the bone marrow. With this sort of leukemia, cells recreate and develop in the marrow, diminishing the marrow’s capacity to deliver enough functional blood cells.

The diagnosis of leukemia typically relies on the complete blood count (CBC), in which physicians check the complete count of white blood cells, red blood cells, and platelets. This complete blood count test can show leukemia cells, but often this is not adequate for physicians to confirm that the patient has leukemia. Other techniques are used, including bone marrow aspiration and microscopic examination of blood smears. However, all these manual methods require much effort and time. Additionally, extensively trained therapeutic experts are required to carry out this type of inspection. Despite what might be expected, computerized demonstrative frameworks can address these issues of manual analysis. In addition, they can lessen the need for medical experts and can give exact and viable outcomes as compared to manual diagnosing (Table 2).

Table 2

Descriptions of acute myeloid leukemia dataset.
Number of featuresName of features/attributesNumber of instances/samplesNumber of class
71. Subject identifier (id)6462
2. Treatment arm A or B (trt)
3. Time to death or last follow-up (futime)
4. 1 if fulltime is a death, 0 for censoring (death)
5. Time to hematopoietic stem cell transplant (txtime)
6. Time to complete response (crtime)
7. Time to relapse of disease (rltime)

Table 2

An acute myeloid leukemia dataset has been used for the analysis. The myeloid dataset is available in (Picostat, 2018) and is also found in the R package. This dataset includes seven features, including class, and 646 instances or samples. The death feature is a class (two), where 1 indicates fulltime is a death, 0 for sensoring.

4.2: Framework for the proposed system

The authors have used the KNIME platform in implementing this work. It is an open-source platform that has many functionalities, including data mining, statistics, etc. It also helps in analyzing the results by plotting line graphs, bar graphs, etc. There are many functionalities accessed by drag and drop to the workspace or by double clicking to select. Using KNIME, we configured the function and executed it. First, we selected the dataset by browsing files on the system. The other option is that.csv files can be directly exported to the software and then they can be executed. Next the Partitioning tool was used to partition the data in 70:30 for classification methods like SVM and k-NN, and we also applied clustering algorithms like k-means and fuzzy c-means where data partitioning was not needed.

The proposed work consists of two phases: Phase I deals with the collection and visualization of datasets and Phase II deals with machine learning and data mining techniques for the prediction of leukemia.

In Fig. 1, it can be observed that the dataset has been collected online. First, we start or run the software and then import the collected online dataset. To perform the classification algorithms, we need to partition the dataset where clustering algorithms are not needed. In this work, SVM and k-NN were used for classification algorithms, whereas k-means and fuzzy c-means have been used as clustering algorithms. Finally, this algorithm predicts leukemia and our software work is finished.

Fig. 1
Fig. 1 Block diagram of proposed methodology.

The classification algorithms are briefly explained in the following paragraphs (Fig. 2).

Fig. 2
Fig. 2 Basic architecture of linear SVM.

4.2.1: Support vector machine

Vladimir N. Vapnik and Alexey Ya. Chervonenkis invented the SVM algorithm in 1963. Corinna Cortes and Vapnik’s current standard incarnation (soft margin) was proposed in 1993 and published in 1995. The objective of the SVM algorithm is to find a hyperplane that best separates points in a hypercube. The nearest instances on either side of the boundary are called support vectors. The basic architecture of linear SVM is shown as:

The algorithms for SVM are noted as:

Algorithm

  1. 1. SVM finds the widest street (plane) between the nearest points on either side. It may not be possible to obtain a hard decision boundary that clearly segregates the points.
  2. 2. Hard margin classifier is sensitive to outliers, one or more data points which are on the other side of the classification boundary.
  3. 3. Soft margin classifiers allow some violations of the decision boundary.
  4. 4. Smart transformation resolves many such cases, known as kernel tricks.
  5. 5. Decision boundary equation can be written as w1x1 + w2x2 + b, where w1, w2, b are model parameters.
  6. 6. The training process of SVM classification will determine the values of w1x1 + w2x2 + b. So for any reviews, we will have the values of x1 and x2.
  7. 7. The decision plane will separate points based on whether w1x1 + w2x2 + b is equal to or greater than 0.

4.2.2: K-nearest neighbor

K-NN was introduced by Fix and Hodges in 1951. K-NN is a simple, powerful, nonparametric, lazy learning method utilized for classification. In the beginning of the 1970s, k-NN was being used in statistical estimation and pattern recognition. The same algorithm was used by Choubey et al. (2020b) for the classification of diabetes (Fig. 3).

Fig. 3
Fig. 3 Flowchart for K-nearest neighbor algorithm.

Algorithm

Let m be the number of training data samples. Let p be an unknown point.

  1. 1. Store the training samples in an array of data points arr[]. This means each element of this array indicates a tuple (x, y).
  2. 2. for i = 0tom:
    Calculate Euclidean distance d(arr[i], p).
  3. 3. Make set S of K smallest distances obtained. Each of these distances corresponds to an already classified data point.
  4. 4. Return the majority label among S.

The clustering algorithms are noted as:

4.2.3: K-means clustering

K-means clustering is an algorithm used to group the objects based on features into K number of groups. It works on an unlabeled dataset (unsupervised machine learning) (Fig. 4).

Fig. 4
Fig. 4 Flowchart for K-means algorithm.

K-means will split a dataset into K clusters:

  •  Where each observation in Ki is as similar to the others in that cluster as possible.
  •  Where the data in Ki is as different as possible from the other clusters within K1………….N.

K-means clustering is an exploratory data analysis technique. The algorithms for k-means clustering are noted as:

Algorithm

Step 1. Take mean value (random).

Step 2. Find nearest number of mean and put in cluster.

Step 3. Repeat steps 1 and 2 until we get the same value.

4.2.4: Fuzzy c-means clustering

Dunn created the fuzzy C-means clustering method in 1973 and it was improved by Bezdek in 1981. It is commonly utilized in design acknowledgments. It depends on minimization of the accompanying target work:

Jm=i=1Nj=1Cuijmxicj2,1m<

si1_e

where m is any real number greater than 1, uij is the degree of membership of xi in the cluster j, xi is the ith of d-dimensional measured data, cj is the d-dimension center of the cluster, and ‖ * ‖ is any norm expressing the similarity between any measured data and the center. Fuzzy partitioning is carried out through an iterative optimization of the objective function shown previously, with the update of membership uij and the cluster centers cj by:

uij=1k=1cxicjxick2m1

si2_e

where

cj=i=1Nuijm.xii=1Nuijm

si3_e

This iteration will stop when maxij{| uij(k + 1) − uijk | } < ɛ, where ɛis a termination criterion between 0 and 1, whereas k are the iteration steps. This procedure converges to a local minimum or a saddle point of Jm.

5: Simulation results and discussion

Every study must start with accurate data analysis. The myeloid dataset was used for the analysis, which is also available in R packages.

The performance evaluations of the classification algorithms used for leukemia are given in Table 3.

Table 3

Performance evaluation of classification algorithms for leukemia.
DatasetAlgorithmsSensitivity/recallPrecisionF-Measure
Myeloid datasetSVM0.97460.97460.9746
K-NN0.95700.95700.9570

Table 3

In Table 3, it may be observed that SVM performs better than k-NN classification algorithms.

The performance evaluations of the clustering algorithms used for leukemia are given in Table 4.

Table 4

Performance comparison of clustering algorithms for leukemia.
DatasetAlgorithmF-Measure
Myeloid datasetK-means0.819
Fuzzy c-means0.829

Table 4

In Table 4, it may be observed that fuzzy c-means performed better than the k-means algorithm.

Now for the analysis of clustering and classification methods as shown in the following figures.

Fig. 5 shows estimates of victims.

Fig. 5
Fig. 5 Different graphs showing estimates of victims.

In Fig. 5, the number of patients all over the country in different states are clearly visible. Also, different graphs have been created that give an idea of the demographics of leukemia.

Fig. 6 indicates the spread of leukemia across different states in the United States.

Fig. 6
Fig. 6 United States map showing leukemia spread across different states.

Fig. 7 shows the overall summary of the analysis of leukemia.

Fig. 7
Fig. 7 Overall summary of leukemia analysis.

Fig. 7 depicts the overall summary of the data analysis report, including doctors required average rate of get infected and many more can be understood by the picture.

Fig. 8 indicates the pivot table for the myeloid dataset.

Fig. 8
Fig. 8 Pivot table for myeloid dataset.

In Fig. 9, the basic workflow of k-means and fuzzy c-means clustering methods is clearly visible.

Fig. 9
Fig. 9 Basic work flow diagram.

In Figs. 10 and 11, we may clearly see k-means clustering results with different numbers of clusters.

Fig. 10
Fig. 10 K-means clustering plot 1.
Fig. 11
Fig. 11 K-means clustering plot 2.

Fig. 12 represents the clusters formed by the fuzzy c-means method.

Fig. 12
Fig. 12 Fuzzy c-means clustering for six clusters.

In Fig. 12, six clusters have been formed and respectively show “futime.” From this figure we can see that the clusters are forming wave-like patterns.

6: Conclusion and future directions

We have applied various algorithms to determine which one is most efficient in the diagnosis of leukemia. Results are shown in both graphical as well as tabular forms for the various algorithms that have been applied to leukemia. The KNIME was used to apply the algorithms and find an appropriate result for the classification and clustering methods of leukemia. The performance is different in each case and we have tried to find the most efficient algorithm for the diagnosis of leukemia. We have also studied the performance of various existing algorithms that have been used previously by fellow researchers. In this chapter we have utilized SVM and k-NN for classification and k-means and fuzzy c-means for clustering. Both the classification and clustering methods have been used in the prediction of leukemia.

The future directions for researchers are to deploy different deep-learning architectures for prediction of leukemia and compare these architectures to find those that perform best. We may also deploy deep-learning architectures for larger samples of datasets. Another future direction for researchers is the design of an automated detection system for leukemia blood cancer.

References

Abdeldaim A.M., Sahlol A.T., Elhoseny M., Hassanien A.E.Computer-aided acute lymphoblastic leukemia diagnosis system based on image analysis. Stud. Comput. Intell. 2018;730:131–147. doi:10.1007/978-3-319-63754-9_7.

Bala K., Choubey D.K., Paul S.Soft computing and data mining techniques for thunderstorms and lightning prediction: a survey. In: Proceedings of the International Conference on Electronics, Communication and Aerospace Technology, ICECA 2017, 2017-Janua; 2017:doi:10.1109/ICECA.2017.8203729.

Bala K., Choubey D.K., Paul S., Lala M.G.N.Classification techniques for thunderstorms and lightning prediction: a survey. In: Soft-Computing-Based Nonlinear Control Systems Design. IGI Global; 2018:1–17.

Chandrasekar R.M., Palaniammal V., Phil M.Performance and evaluation of data mining techniques in cancer diagnosis. IOSR J. Comput. Eng. 2013;15(5):39–44.

Choubey D.K., Paul S.GA_J48graft DT : a hybrid intelligent system for diabetes disease diagnosis. Int. J. Biosci. Biotechnol. 2015;7(5):135–150.

Choubey D.K., Paul S. GA_MLP NN: A Hybrid Intelligent System for Diabetes Disease Diagnosis. 2016a.49–59. doi:10.5815/ijisa.2016.01.06.

Choubey D.K., Paul S. Classification techniques for diagnosis of diabetes: a review. Int. J. Biomed. Eng. Technol. 2016b;21(1):doi:10.1504/IJBET.2016.076730.

Choubey D.K., Paul S. GA_SVM: a classification system for diagnosis of diabetes. In: Handbook of Research on Soft Computing and Nature-Inspired Algorithms. 2017a:doi:10.4018/978-1-5225-2128-0.ch012.

Choubey D.K., Paul S. GA-RBF NN: a classification system for diabetes. Int. J. Biomed. Eng. Technol. 2017b;23(1):71–93. doi:10.1504/IJBET.2017.082229.

Choubey D.K., Paul S., Dhandhenia V.K.Rule based diagnosis system for diabetes. Biomed. Res. 2017;28(12).

Choubey D.K., Paul S., Shandilya S., Dhandhania V.K.Implementation and analysis of classification algorithms for diabetes. Curr. Med. Imaging Rev. 2018;14:340–354. doi:10.2174/1573405614666180828115813.

Choubey D.K., Paul S., Bala K., Kumar M., Singh U.P. Implementation of a hybrid classification method for diabetes. In: Intelligent Innovations in Multimedia Data Engineering and Management. IGI Global; 2019a:201–240.

Choubey D.K., Tripathi S., Kumar P., Shukla V., Dhandhania V.K. Classification of diabetes by kernel based SVM with PSO. Recent Pat. Comput. Sci. 2019b;1–14.

Choubey D.K., Kumar M., Shukla V., Tripathi S., Dhandhania V.K.Comparative analysis of classification methods with PCA and LDA for diabetes. Curr. Diabetes Rev. 2020a;16:doi:10.2174/1573399816666200123124008.

Choubey D.K., Kumar P., Tripathi S., Kumar S.Performance evaluation of classification methods with PCA and PSO for diabetes. Netw. Model. Anal. Health Inform. Bioinform. 2020b;9(1):1–17. doi:10.1007/s13721-019-0210-8.

Choudhury T., Kumar V., Nigam D.Cancer research through the help of soft computing techniques: a survey. Int. J. Comput. Sci. Mob. Comput. 2013;2(April):467–477.

Daqqa K.A.S.A., Maghari A.Y.A., Al Sarraj W.F.M.Prediction and diagnosis of leukemia using classification algorithms. In: ICIT 2017—8th International Conference on Information Technology, Proceedings, October; 2017:638–643. doi:10.1109/ICITECH.2017.8079919.

Dash S., Patra B., Tripathy B.K.A hybrid data mining technique for improving the classification accuracy of microarray data set. Int. J. Inf. Eng. Electron. Bus. 2012;4(2):43–50. doi:10.5815/ijieeb.2012.02.07.

Do P., Byrd J.C.Mass cytometry: a high-throughput platform to visualize the heterogeneity of acute myeloid leukemia. Cancer Discov. 2015;5(9):912–914. doi:10.1158/2159-8290.CD-15-0905.

Escalante H.J., Montes-y-Gómez M., González J.A., Gómez-Gil P., Altamirano L., Reyes C.A., Reta C., Rosales A.Acute leukemia classification by ensemble particle swarm model selection. Artif. Intell. Med. 2012;55(3):163–175. doi:10.1016/j.artmed.2012.03.005.

Fuse K., Uemura S., Tamura S., Suwabe T., Katagiri T., Tanaka T., Ushiki T., Shibasaki Y., Sato N., Yano T., Kuroha T., Hashimoto S., Furukawa T., Narita M., Sone H., Masuko M.Patient-based prediction algorithm of relapse after Allo-HSCT for acute leukemia and its usefulness in the decision-making process using a machine learning approach. Cancer Med. 2019;8(11):5058–5067. doi:10.1002/cam4.2401.

Hassane D.C., Guzman M.L., Corbett C., Li X., Abboud R., Young F., Liesveld J.L., Carroll M., Jordan C.T.Discovery of agents that eradicate leukemia stem cells using an in silico screen of public gene expression data. Blood. 2008;111(12):5654–5662. doi:10.1182/blood-2007-11-126003.

Kumar S., Mishra S., Asthana P., Pragya. Automated detection of acute leukemia using K-mean clustering algorithm. Adv. Intell. Syst. Comput. 2018a;554:655–670. doi:10.1007/978-981-10-3773-3_64.

Kumar M., Mishra S.K., Choubey S.K., Tripathy S.S., Choubey D.K., Das D.Cat swarm optimization based functional link multilayer perceptron for suppression of Gaussian and impulse noise from computed tomography images. Curr. Med. Imaging. 2018b;16(4):329–339. doi:10.2174/1573405614666180903115336.

Kumar S., Mohapatra U.M., Singh D., Choubey D.K.EAC: efficient associative classifier for classification. In: Proceedings—2019 International Conference on Applied Machine Learning, ICAML 2019; 2019:15–20. doi:10.1109/ICAML48257.2019.00011.

Kumar S., Bhusan B., Singh D., Choubey D.K.Classification of diabetes using deep learning. In: Proceedings of the 2020 IEEE International Conference on Communication and Signal Processing, ICCSP 2020, Dl; 2020a:651–655. doi:10.1109/ICCSP48568.2020.9182293.

Kumar M., Jangir S.K., Mishra S.K., Choubey S.K., Choubey D.K.Multi-Channel FLANN Adaptive Filter for Speckle & Impulse Noise Elimination from Color Doppler Ultrasound Images. 2020b.1–4. doi:10.1109/iconc345789.2020.9117288.

Li Y., Zhu R., Mi L., Cao Y., Yao D.Segmentation of white blood cell from acute lymphoblastic leukemia images using dual-threshold method. Comput. Math. Methods Med. 2016;2016:doi:10.1155/2016/9514707.

Pahari S., Choubey D.K.Analysis of liver disorder using classification techniques: a survey. In: International Conference on Emerging Trends in Information Technology and Engineering, Ic-ETITE 2020; 2020:1–4. doi:10.1109/ic-ETITE47903.2020.300.

Panda M., Vihar V. Towards the Effectiveness of Deep Convolutional Neural Network Based Fast Random Forest Classifier. 2016 ArXiv, abs/1609.0.

Parthvi A., Rawal K., Choubey D.K.A comparative study using machine learning and data mining approach for leukemia. In: Proceedings of the 2020 IEEE International Conference on Communication and Signal Processing, ICCSP 2020; 2020:672–677. doi:10.1109/ICCSP48568.2020.9182142.

Picostat.Leukemia, A. Myeloid. No Title Https://Www.Picostat.Com/Dataset/r-Dataset-Package-Survival-Myeloid. 2018.

Priyanga A., Prakasam S.Effectiveness of data mining-based cancer prediction system (DMBCPS). Int. J. Comput. Appl. 2013;83(10).

Sewak M.S., Reddy N.P., Duan Z.H.Gene expression based leukemia sub—classification using committee neural networks. Bioinf. Biol. Insights. 2009;3:89–98.

Shafique S., Tehsin S.Acute lymphoblastic leukemia detection and classification of its subtypes using pretrained deep convolutional neural networks. Technol. Cancer Res. Treat. 2018;17:1–7. doi:10.1177/1533033818802789.

Sharma D., Jain P., Choubey D.K.A comparative study of computational intelligence for identification of breast cancer. In: International Conference on Machine Learning, Image Processing, Network Security and Data Sciences; 2020:209–216.

Shouval R., Labopin M., Bondi O., Mishan-Shamay H., Shimoni A., Ciceri F., Esteve J., Giebel S., Gorin N.C., Schmid C., Polge E., Aljurf M., Kroger N., Craddock C., Bacigalupo A., Cornelissen J.J., Baron F., Unger R., Nagler A., Mohty M. Prediction of allogeneic hematopoietic stem-cell transplantation mortality 100 days after transplantation using a machine learning algorithm: a European group for blood and marrow transplantation acute leukemia working party retrospective data mining stud. J. Clin. Oncol. 2015;33(28):3144–3151. doi:10.1200/JCO.2014.59.1339.

Sivaraman A., Rajesh S.A., Lakshmi M.Optimistic diagnosis of acute leukemia based on human blood sample using feed forward back propagation neural network. Int. J. Innov. Res. Sci. Eng. Technol. 2014;3(3):1046–1049.

Srivastava K., Choubey D.K.Soft computing, data mining, and machine learning approaches in detection of heart disease: a review. In: International Conference on Hybrid Intelligent Systems; 2019:165–175.

Srivastava K., Choubey D.K.Heart disease prediction using machine learning and data mining. Int. J. Recent Technol. Eng. 2020;9(1):21–219. doi:10.35940/ijrte.f9199.059120.

Suji R.J., Rajagopalan S.P.An automatic oral cancer classification using data mining techniques. Int. J. Adv. Res. Comput. Commun. Eng. 2013;2(10):3759–3765.

Valdés J.J., Barton A.J.Gene discovery in leukemia revisited: a computational intelligence perspective. Lect. Notes Artif. Intell. 2004;3029:118–127. doi:10.1007/978-3-540-24677-0_13.

Vasighizaker A., Sharma A., Dehzangi A.A novel one-class classification approach to accurately predict disease-gene association in acute myeloid leukemia cancer. PLoS One. 2019;14(12):1–12. doi:10.1371/journal.pone.0226115.

Wang Y., Tetko I.V., Hall M.A., Frank E., Facius A., Mayer K.F.X., Mewes H.W.Gene selection from microarray data for cancer classification—a machine learning approach. Comput. Biol. Chem. 2005;29(1):37–46. doi:10.1016/j.compbiolchem.2004.11.001.

Warnat-Herresthal S., Perrakis K., Taschler B., Becker M., Baßler K., Beyer M., Günther P., Schulte-Schrepping J., Seep L., Klee K., Ulas T., Haferlach T., Mukherjee S., Schultze J.L.Scalable prediction of acute myeloid leukemia using high-dimensional machine learning and blood transcriptomics. IScience. 2020;23(1):doi:10.1016/j.isci.2019.100780.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.135.246.245