Chapter 21: An ensemble classifier approach for thyroid disease diagnosis using the AdaBoostM algorithm

Giuseppe Ciaburro    Department of Architecture and Industrial Design, Università degli Studi della Campania Luigi Vanvitelli, Aversa, Italy

Abstract

The use of information technology in medicine has been a reality for some time now, and the continuous progress that is taking place is reflected in the technologies applied to the care of people who become increasingly avant-garde. These technologies are used with the fundamental objective of improving the health and life expectancy of the world population. Ensemble learning methods provide more correct decision-making processes, at the expense of greater complexity and a loss of interpretability, compared to learning systems based on single hypotheses. The ensemble learning combines the predictions of hypothesis collections to obtain greater performance efficiency. In this chapter, we will explore how to use the ensemble methods for the diagnosis of thyroid disease. After analyzing the concepts behind the different Ensemble Learning methods, we will present a practical case in which we will use AdaBoostM algorithm for the diagnosis of thyroid disorders.

Keywords

Ensemble learning; AdaBoostM algorithm; Big data; Deep learning; Machine learning; Combination methods; Healthcare analytics; Medical informatics; Healthcare applications

1: Introduction

Thanks to digital technologies we can extract knowledge from data by attributing intelligence to objects, through a connection between them. This new way of thinking about data has revolutionized processes and services, providing a new reading key to information from various sectors, so much so that the world of health has also been affected by this revolution (Ward, 2013; Mosavi et al., 2019). The healthcare sector is characterized by a series of problems and critical issues that have often affected its efficiency, offering patients a poor-quality service with significant costs (Hassan et al., 2019). To tackle this problem, a clear transformation of the system was required that would make it possible to respond to requests for improvement in the service provided (Wan, 2006; Beam et al., 2020).

The use of Information Technology (IT) in healthcare products, services, and processes accompanied by organizational changes and the development of new skills has introduced significant improvements in efficiency and productivity in the healthcare sector, as well as greater economic and social value of the health (Wan et al., 2020). IT is applied to medicine for personal care, which is at the center of the therapeutic, diagnostic, or preventive project, and as such receives or requires medical, health, or socio-sanitary acts (Dua et al., 2014; Liyanage et al., 2019).

Personal care, understood as the relationship between the person and the health system, has not changed. What has undergone is a clear transformation by the way in which health care is provided, both in terms of performing a medical service and in terms of organizing related services. The introduction of Information Technology has made health care more convenient from an economic point of view, as it reduces waste and inefficiencies with greater citizen involvement (Van Calster et al., 2019). IT allows a substantial reduction in the consumption of resources, both for the healthcare professional and for the citizen-user. The increase in productivity derives from the reduction of medical errors, from the attenuation or elimination of unnecessary treatment, from the reduction of waiting lines, from the limitation of the movements of citizens in the territory, from the reduction of waiting lists, and from the simplification of access to patient data and facilitating disease treatment (Siau and Shen, 2006; Kwak and Hui, 2019; Jewell et al., 2020).

A digital report without leaving home, paying for a specialist reservation without going to the office, or changing your doctor from the comfort of your PC are just some examples of improvements introduced using new technologies. The use of IT allows a tracking of operations to guarantee respect for privacy and the tracking of any access to clinical documents (Steil et al., 2019; Lanzing, 2019). The economic convenience introduced by the use of IT is closely connected with the increase in productivity and derives from the reduction of medical errors, from the mitigation or elimination of unnecessary treatment, through greater communication between the different healthcare institutions and the same professionals (Gu et al., 2019; Iftikhar et al., 2019). Another saving is obtained by reducing and/or eliminating paper material (Usak et al., 2020).

In this work, we used ensemble methodologies to preventively diagnose endocrinologic disorders such as those related to the thyroid. The data used as input derive from tests carried out on a population sample with the collection of numerous indicators. The term ensemble refers to a set of basic learning machines whose predictions are combined to improve the overall performance. The ensemble methods can be divided into two categories: generative and nongenerative. The nongenerative ones try to combine in the best possible way the predictions made by the machines, while the generative ones generate new sets of learners, to generate differences between them that can improve the overall performance. An ensemble method is a technique that combines predictions from multiple machine-learning algorithms to make predictions more accurate than any single model. By using multiple methods in modeling, prediction skills are improved as each contribution seeks to reinforce the weaknesses of the others.

The first part of the paper introduces the basics of the techniques based on Machine Learning with particular attention to the methods used in the field of Medical Informatics. Subsequently, the main algorithms based on Ensemble Learning are treated in detail and then move on to a rich review of the main works that have used a Machine Learning-based model for disease diagnosis. Finally, a specific case is treated: Predicting thyroid disease using ensemble learning.

2: Data analytics

Health care supported by digital technologies has generated in recent years a large volume of useful data on the clinical history of patients, on the treatment plans to which they have undergone, on the costs that the therapies applied have produced, and finally on the insurance coverage which the patients enjoyed. This amount of data has attracted the attention of data analysis experts who have been concerned with developing methodologies to analyze them (Zikopoulos and Eaton, 2011).

Data Analytics are tools that are based on statistical inference to examine in depth the raw data and knowledge available to identify correlations, trends or verify existing theories and models. They answer questions precisely and start from hypotheses formulated from the beginning, focusing on sectors, with the aim of obtaining the best practices that lead to an improvement of the system. These tools allow you to make future forecasts and what-if simulations to verify the effects of certain changes on the system, as well as to obtain more detailed and in-depth analyses. Using more sophisticated techniques and tools, they can analyze much larger datasets. Analysis tools help analysts turn data into knowledge. The ability to analyze a large amount of information, often unstructured, represents a clear source of competitive advantage and differentiation. Big Data, combined with sophisticated data analysis, have the potential to provide researchers with unprecedented insights into patient behavior and health system conditions, enabling decisions to be made faster and more effectively (Raghupathi and Raghupathi, 2014).

The data analysis examines the data with the aim of extracting knowledge from the information. In emergency assistance, data analysis helps emergency teams to efficiently select raw data, message traffic and news feeds from the Internet to instantly define where and when a health emergency is occurring. In preventive care, data analysis identifies outbreaks, trends, and prepares health specialists for the challenges they will face in the future (Kambatla et al., 2014).

Medical research also benefits greatly from data analysis. The ability to collect research, filter results, and stay abreast of the latest research-based best practices helps teams collaborate, improve test methods, and successfully apply for grants based on updated needs and information. Data analysis is more than just a hypothesis, but a determination of future events based on current facts and trends. Data analysis can be divided into two different spectra: exploratory data analysis and confirmation data analysis (Palanisamy and Thirunavukarasu, 2019). Exploratory data analysis, also known as EDA, is used to determine new trends in a market or sector. The analysis of confirmation data, or CDA, is used to demonstrate or refute existing hypotheses. In the medical field, the CDA is used in various sectors, from identifying the origin of a specific disease to which common drugs are most useful in the treatment of current symptoms. This is generally the way new drugs are developed, where research over time uses a combination of products and drugs to test and improve treatments. After years of research and combined data, the researcher can perform a complete analysis of the data, to determine whether the medical combination can cure the disease (Martinez et al., 2010).

Effective health analysis requires much more than extracting information from a database, applying a statistical model, and passing results to various end users. The process of transforming the data acquired in the source systems into information used by the health organization to improve quality and performance requires specific knowledge, adequate tools, quality improvement methodologies, and management commitment. Healthcare transformation efforts require decision makers to use information to understand all aspects of an organization’s performance. In addition to knowing what happened, decision makers now need information on what is likely to happen, what the organization’s improvement priorities should be, and what the expected impacts of the process and other improvements will be. Simply producing reports and visualizations of data from the health data repository is not enough to provide the information decision makers need (Reddy and Aggarwal, 2015).

Data Analytics can help decision makers achieve understanding of quality and operational performance by transforming the way information is used and decisions are made across the organization. Data Analytics is the system of tools and techniques necessary to generate understanding of data. The effective use of analysis within an organization requires that the necessary tools, methods, and systems have been applied appropriately and consistently and that the information generated by the analysis is accurate, validated, and reliable (Strome and Liefer, 2013).

3: Machine learning

Recently, a new tool for knowledge extraction has appeared in the panorama of Data Analytics: It is Machine Learning, a class of algorithms that using optimization techniques manage to retrieve useful information from data automatically. Machine Learning is a branch of Artificial Intelligence that includes all the studies on algorithms capable of performing a task with better performance as the experience grows. What makes this field extremely innovative is that it makes the machine an entity capable of carrying out inductive reasoning based on experience, in a completely analogous way to how man himself does it (Alpaydin, 2020).

Therefore, based on a training set of data from a certain probability distribution, the machine must be able to deal with new problems, therefore not known a priori, by building a probabilistic model of the occurrence space (Ciaburro, 2020). The computational analysis of machine learning algorithms is one of the works performed in learning theory, a field of study that aims to solve with various approaches, although never being able to offer certainties about the results, both for the finished quantity of data for training and for any underfitting and overfitting problems, essentially due to a disproportion between the required parameters and the number of observations (Marsland, 2015).

With the progress of studies and with the recognition of the various facets of the problems faced, there is a subdivision of the Machine Learning field into various branches that differ in the approach to solving the problems faced, for the type of data processed and for the task to be performed by the algorithm (Ciaburro and Venkateswaran, 2017).

Several paradigms have been developed, based on which to classify this type of algorithm:

  •  Supervised learning: The model is trained by collecting input data, to then obtain outputs that allow one to formulate a general rule to correctly associate inputs and outputs.
  •  Unsupervised learning: The model takes unlabeled inputs as an input and tries to generate a structure common to these input data.
  •  Reinforcement learning: The model interacts with a dynamic environment and is notified or rewarded only if the goal to be accomplished is accomplished.

In this field there are numerous approaches, which are based both on the type of strategies adopted and on the models generated. Moving from decision trees, graphs that allow decision-making through paths that lead to the prediction of a given variable by classification, to genetic algorithms, algorithms that emulate the phenomenon of natural selection and genetic evolution through techniques such as mutation and crossover, from inductive logic programming, an approach that links propositional logic to symbolic learning and that makes extensive use of entailment starting from knowledge bases, to Bayesian networks, graphical representations of the dependency relationships between the variables of a system that provide a specific of any complete joint probability distribution. Among the algorithms based on Machine Learning, ensemble learning has proved particularly effective in supervised classification (Ciaburro, 2017).

4: Approaching ensemble learning

Artificial intelligence studies the reproducibility of complex mental processes through the use of computers and pays particular attention to how they perceive the external environment, how they interact with it, and how they are able to learn and solve problems, elaborate information, and reach decisions. Of great importance is the learning activity, which allows to increase the knowledge of the machine and to make adaptive changes, so that the decision-making process in a later period is more efficient (Polikar, 2012).

Therefore, the realization of an inductive as well as deductive learning is important, that is, a process that starting from a collection of examples concerning a specific sphere of interest, he arrives at the formulation of a hypothesis capable of predicting future examples. The conceptual difficulty in formulating a hypothesis consists in the impossibility of establishing whether it is a good approximation of the function that it must emulate or not, but it is possible to draw qualitative considerations, so a hypothesis that will be able to make a good generalization, so he will be able to correctly predict examples he has not seen so far. Furthermore, it is possible to arrive at the formulation of several hypotheses, with similar predictive capacity, all consistent. In this case, the optimal choice will be the simplest solution (Zhang and Ma, 2012).

A substantial difference between deductive and inductive inference is that the former guarantees logical correctness, as it does not alter the reality of interest, while the latter can tend toward excessive generalization and therefore carry out incorrect selective processes. Artificial intelligence systems are reflected in a vast domain of applications concerning a high quantity and heterogeneity of sectors, from the interpretation of natural language, to games, to the demonstration of mathematical theorems, to robotics. Furthermore, machine learning has great relevance in Data Mining, that is, the extraction of data and knowledge from an enormous wealth of information through automated methods. To reproduce typical activities of the human brain, such as the recognition of shapes and figures, the interpretation of language, and the perception of the environment in a sensorial way, neural networks have been created, with the aim of simulating the functioning of the animal brain on the computer (Krawczyk et al., 2017).

Ensemble learning methods are very powerful techniques for obtaining more correct decision-making processes, at the expense of greater complexity and a loss of interpretability, compared to learning systems based on single hypotheses. The ensemble learning combines the predictions of hypothesis collections to obtain greater performance efficiency. For explanatory purposes, it is useful to compare this type of learning to an executive committee of a company, in which several people with certain skills present their ideas to reach a final decision. It is evident that the knowledge of a group of people can lead to a more thoughtful solution than the decision of a single director. Also, this type of analogy allows us to make qualitative considerations that will be applicable in the domain of artificial intelligence (Gomes et al., 2017).

In fact, it is easy to imagine that if the directors all have very similar knowledge and therefore limited to the same area, the analyses and decisions of the individuals will not be very heterogeneous and we will obtain benefits to a lesser extent in the use of the ensemble. If, on the other hand, the knowledge of each director is highly sectored, so that each participant on the committee adds knowledge that is not replicated, the final decision will be more appropriate. If we assume that there is a director at the board’s management, who once had taken into consideration the individual opinions (votes), comes to a solution, it is reasonable to think that he trusts most of those who made more correct choices in the past, in the case I therefore practice advisors who he deems most reliable (Wang et al., 2016).

Similarly, in committee learning, the votes of the individual hypotheses are weighted to emphasize the predictions of those deemed most correct. In conclusion, ensemble learning combines different models derived from the same training set to produce a combination of them that widens the space of hypotheses. The term ensemble means a set of basic learning machines whose predictions are combined to improve overall performance (Akyuz et al., 2017).

Ensemble methods can be divided into two categories: generative and nongenerative. The nongenerative ones try to combine the predictions made by the machines in the best possible way, while the generative ones generate new sets of learners, to generate differences among them that can improve overall performance.

As for nongenerative techniques, in the classification, for example, the technique of major voting is used, possibly refined by weighing the votes proposed by the machines. Predictions can be combined by means of possibly weighted, median, product, sum, or by choosing the minimum or the maximum. Generative methods attempt to improve system performance by attempting to use diversity between learners. To do this, different sets are generated with which to train the machines using the resampling technique, or the aggregation of classes is manipulated differently through the feature selection technique, or even learners can be trained who specialize on specific parts of the set of learning with the mixture of expert techniques. Resampling techniques, such as bootstrapping, allow you to generate new sets from the original one (Wang et al., 2018).

The most common and efficient ensemble learning methods are:

  •  bagging
  •  boosting
  •  stacking

However much they can improve the actual performance of the system, it is necessary to take into account a greater difficulty of analysis, since they can be composed of a substantial number of single models, which is why it will be difficult to intuitively understand which factors contributed to the improvement and which ones instead can be considered redundant or degrading.

For example, several decision trees generated on the same dataset could be considered and a vote by each of them on the classification of new data. The final predictive capacity will easily be more correct than that of the individual models and this could be significantly affected even by a subset of the trees involved in the choice, but intuitively it will be difficult to recognize this subset (Ciaburro and Iannace, 2020).

The simplest technique for combining the responses of individual classifiers into a single prediction is that of weighted votes and is used by both boosting and bagging, the substantial difference of which is in the way in which these models are generated (Alam et al., 2019).

5: Understanding bagging

Bagging is an ensemble method that takes its name from the union of the words Bootstrap AGGregatING and provides for the assignment of identical weights for each individual model applied to a specific set of training. It may be thought that if several training sets of equal cardinality are extracted starting from the same domain of a problem and a decision tree is built for each one, then these trees will be similar and will arrive at identical predictions for a new test example. Bootstrapping consists of the extraction with replacement of its elements to create new training sets that are different from each other. The probability of extraction of each example, in bagging, is equal to that of the others. The basic algorithm involves the creation of models for each training set and subsequently the combination of the various predictions on the test set through an average operation (Baskin et al., 2017).

This assumption is usually incorrect due to the instability of the decision trees, which owing to small changes in the input attributes can correspond to large changes in terms of ramifications and therefore lead to different classifications. It is implicit that if starting from the same basic set, one achieves remarkably different results, then the outputs can be both correct and wrong. In a system where there is a hypothesis for each training set and whose response to a new example is determined by the votes of the individual hypotheses (majority vote), a correct prediction will be more necessary than that which we would obtain starting from a single model. Nonetheless, it will always be possible to find an incorrect answer, as no learning scheme is affected by error. An estimate of the expected error in the architecture assumed can be calculated as the average of the errors of the individual classifiers. To better understand the characteristics of bagging, it is good to analyze bias errors, variance errors, and bootstrap first (Yaman and Subasi, 2019).

Given the following initial dataset:

C=x1y1(xnyn)

si1_e  (1)

A few new datasets Ck (with k = 1, …, m) are extracted from the set C using the replacement technique. For each Ck dataset obtained, a predictive model is elaborated, according to the following function:

fxC=1mi=1mfkxCk

si2_e  (2)

The algorithm brings improvements, thanks to the diversity of the various fk models (Fig. 1). For this reason, less stable basic models are recommended, that is, capable of producing consistent differences also starting from similar training sets. This does not mean that the basic machines must necessarily be different. Bagging improves overall prediction performance because it reduces variance if the machines that are part of it have low bias (Subasi and Qaisar, 2020).

Fig. 1
Fig. 1 Bagging operating scheme.

Recall that in an algorithm based on machine learning, we can identify two error components: the first due to the particular learning algorithm used, called bias, and the second relating to the peculiar training set used, indicated with the name of variance. The bias represents a deviation from the current value and cannot be calculated accurately, but can only be approximate, it is also independent of the number of training sets used and is an indicator of the persistence of the error of a specific algorithm.

The variance, on the other hand, is closely related to the training set used and is a measure of the variability of the learning model. In practical terms, if we use different training sets to repeat the training several times, the variance will be the difference between the values predicted by each model.

The choice of the type of architecture to be used in bagging is based on the evaluation of the best performance obtained on the test set: An evaluation of the error of the basic machines in terms of bias and variance can be carried out. Mediating low bias models allows you to obtain a low bias model by reducing the variance. Although the bias-variance analysis of the machines has not been addressed, the choice to consider the model with better performance on the test set proves to be consistent (Ditzler et al., 2017).

You can further criticize the choice by thinking that the best machines are optimal for the original training set and it is not known if they can be used for the various sets generated by the random extraction. The hope is that the best ones have characteristics of complexity suitable for the characteristics of the data to be approximated.

As for the dimensions of the Ck sets, there is no theory that indicates what the optimal value should be. This depends on the quantity and type of data available. Bagging is a parallelizable algorithm since the training of a single machine does not influence in any way that of the others.

This bootstrap aggregation technique tends to neutralize the instability of the learning algorithms and is more useful precisely in learning schemes where there is a high instability, as this implies a greater diversity obtainable with small input variations. Consequently, when bagging is used, attempts are made to make the learning algorithm as unstable as possible. The operation of this method can be summarized in two steps:

  1. 1. the generation of individual models, in which m instances are selected from the training set and the learning algorithm is applied
  2. 2. the combination of the models obtained to produce a classification

The error due to the bias turns out to be the mean square deviation obtained by averaging the numerical classifications of the individual models, while the variance is the distance between the predictions of the individual models and depends on the training set used. Bias and variance are closely related to the complexity of the learning model, so as the parameters added to it increase, the bias component will decay while the variance component will increase exponentially (Singhal et al., 2018).

6: Exploring boosting

The basic idea of the Boosting techniques is to build a list of classifiers by assigning, in an iterative way, a weight to each new classifier. Considering, in this way, its ability to recognize samples not correctly identified by the other classifiers already involved in the training. At each phase of the algorithm, a new classifier is trained using the dataset, in which the weighted coefficients are adjusted based on the performance of the previously trained classifier, so as to assign a greater weight to the incorrectly classified data points (Liu et al., 2018).

The algorithm focuses on the most difficult samples to classify, which are therefore weighted more. The final classifier is obtained with a weighted vote of the built models. As with other ensemble techniques, combining multiple models is particularly effective when they achieve a high percentage of correct predictions and are quite different from each other, i.e., presenting a high rate of variability. The ideal situation to which boosting aims is the maximum sectorization of the models, so that each of them is a specialist in a part of the domain in which the other classifiers fail to arrive at accurate predictions. Therefore, boosting attributes greater weight to instances that have not been correctly predicted, to build high models, in subsequent iterations, capable of filling this gap. In analogy with bagging, only learning algorithms of the same type are combined and their outputs are combined by vote or by averaging the individual responses, in the case of classification or numerical prediction, respectively (Ghojogh and Crowley, 2019).

The algorithm consists of the following steps:

  1. 1. A tree is produced with a process that considers instances with greater weight to be more relevant
  2. 2. The product tree is used to classify the training set
  3. 3. The weight of instances correctly classified is reduced, while that of instances incorrectly classified is increased
  4. 4. Steps 1 to 3 are repeated until a specified number of trees have been produced
  5. 5. A weight proportional to its performance on the training set is assigned to each tree.

For the classification of new instances, a weighted voting system is used, usually majority voting, by all trees. The purpose of this algorithm is to produce different trees, to cover a wider set of types of instances. The defect with which the boosting method is often accused is the susceptibility to noise. If there are incorrect data in the training set, the boosting algorithm will tend to give greater weight to the instances that contain them, inevitably leading to a deterioration in performance. To overcome this problem, boosting algorithms have also been proposed, in which instances that are repeatedly classified incorrectly are interpreted as containing incorrect data and consequently their weight is reduced (Ng et al., 2018).

7: Discovering stacking

Stacked generalization, from whose abbreviation the term stacking derives, is the most recent ensemble technique, conceived to generate a scheme that minimizes the error rate of classifiers. Stacking, by virtue of its functioning, can be considered as a process that evolves the behavior of cross-validation to combine different individual models more efficiently. Stacking is a technique used to obtain high precision in generalization. This method tries to evaluate the reliability of the trees produced and is usually used to combine trees produced by different algorithms. The idea is to extract a new dataset containing an instance for each instance of the original dataset, in which however the original attributes are replaced with the classifications produced by each tree, while the output remains the original class. These new instances are then used to produce a new classifier that combines the different predictions into one. It is suggested to divide the original training set into two subsets. The first used to create the dataset and the second used to produce the base classifiers (Divina et al., 2018).

The new classifier’s predictions will therefore reflect the actual performance of the basic induction algorithms. While classifying a new instance, each base tree produces its prediction. These predictions will constitute a new instance which will be classified by the new classifier. If trees produce a probability classification, it is possible to increase the stacking performance by inserting in the new instances the probabilities expressed by each tree for each class. Stacking performance has been shown to be at least comparable to that of the best classifier chosen by cross-validation (Sun and Trevor, 2018).

7.1: Machine learning applications for healthcare analytics

In recent years, algorithms based on Machine Learning have been applied to extract knowledge in many fields. In HealthCare, these algorithms have been used to improve the quality of life by supporting researchers in activities aimed at diagnosing diseases, analyzing clinical data, in the process of drug discovery, to name a few (Panesar, 2019).

7.2: Machine learning-based model for disease diagnosis

In the last 10 years, there has been a significant and growing use of Machine Learning in HealthCare, which is gaining great interest thanks also to publications that have revealed a precision in specific clinical contexts. Some algorithms have managed to show a diagnostic accuracy comparable to that of doctors experienced in different disciplines. Some examples of applications of machine learning in HealthCare that lead to a benefit of diagnostic accuracy were detection of cancers through the analysis of radiological images, detection of diabetic retinopathy, and algorithms capable of predicting future cardiovascular events (Rojas et al., 2019).

7.3: Machine learning-based algorithms to identify breast cancer

Cancer is a very complex genetic disease, the appearance of which is attributable to certain unwanted genetic mutations. Knowing how to recognize these genetic mutations could be useful for the identification and prevention of tumor development in individuals. Even more useful would be to be able to identify tumor development from a limited number of mutations. Fine-needle aspiration (FNA) of a breast lump allows us to take a sample of cells to be studied under the microscope to discriminate whether a breast lump is of a benign nature or if it is a malignant tumor. A thin needle is inserted into the breast, until it reaches the lump, from where a part of the content to be examined in the laboratory is sucked. The sampling is done simultaneously with an ultrasound scan to locate the nodule. The aspirated biological liquid is subsequently substituted for cytological examination, which consists in the observation under an optical microscope of cells taken to characterize their content through the extraction of features. Agarap (2018) applied six machine learning-based algorithms to identify breast cancer based on features extracted from FNA tests. The dataset used by Agarap for training the algorithm contains the features obtained through the cytological examination. Six algorithms have been used in this study: Linear Regression, Multilayer Perceptron (MLP), Nearest Neighbor (NN), Softmax Regression, Support Vector Machine (SVM), and finally a combination of recurrent neural network (RNN) and the SVM. All the algorithms used have returned satisfactory results with high performance on the binary classification of breast cancer, with an accuracy that has exceeded 90%.

7.4: Convolutional neural networks to detect cancer cells in brain images

Sawant et al. (2018) used an algorithm based on convolutional neural networks (CNNs) to detect cancer cells in brain images obtained by the Magnetic resonance imaging (MRI) test. MRI is a test carried out with an implant that nowadays plays an important role in the health sector and allows you to perform a whole range of diagnostic tests, from traditional to functional neuroradiology, from internal diagnostics to obstetrics and gynecology and pediatrics. The machine is equipped with a magnet that creates a strong and stable magnetic field to align all the protons in the same field and determine their rotation, all in the same direction. A frequency signal is subsequently sent within the magnetic field which determines the misalignment of the protons; at the end of the same, the protons return to their equilibrium position, release energy, which is detected by a receiving coil. From the detection, the time it takes for the protons to return to their aligned position is measured, providing information regarding the type of tissue and reconstructing the image of the site in question. In this study, the authors used the Tensorflow platform to develop the CNN-based algorithm, obtaining 98.6% accuracy on validation data.

7.5: Machine learning techniques to detect prostate cancer in Magnetic resonance imaging

Prostate cancer originates from the cells of the prostate gland. Many forms of prostate cancer grow slowly and are unlikely to spread, but some can proliferate faster. The precise causes of prostate cancer are unknown and early stage prostate cancer is often asymptomatic. Prostate cancer is the second most common malignant neoplasm in the world in men and mainly affects men of advanced age. In fact, over half of prostate cancer cases are diagnosed in men over the age of 70. Early stage prostate cancer is typically asymptomatic. Symptoms that can manifest themselves as the disease progresses are often caused by the compression exerted by the cancerous mass on the urethra and include an increase in the frequency with which you urinate, have difficulty urinating, or feel an urgent need to urinate. Normally, the diagnosis of prostate cancer is based on the results of the clinical examination of the prostate, a blood test that measures the levels of a protein called prostate-specific antigen (PSA) and biopsy. Further investigations may help determine how advanced the cancer is. For example, imaging tests are called the magnetic resonance imaging (MRI). Prostate cancer is diagnosed based on the size of the tumor, the presence or absence of involvement of the lymph nodes, and the presence or absence of spread to other parts of the body. This information is used to facilitate the choice of the optimal therapeutic strategy. The images obtained with the MRI test have a high resolution and require adequate diagnostic tools for a correct interpretation. Hussain et al. (2018) applied an algorithm based on several Machine Learning techniques to detect prostate cancer in images obtained from the Magnetic resonance imaging (MRI) test. To support the radiologist in detecting anomalies, the authors developed algorithms based on the Bayesian approach, on the Support vector machine (SVM), on the radial base function (RBF), and on the Gaussian and Decision Tree. The algorithms were trained using the invariant feature transform (SIFT), and elliptic Fourier descriptors’ (EFDs’) features, among others. The results obtained in the automatic diagnosis of prostate cancer have been satisfactory, returning an accuracy that has reached 98.3% in the SVM Gaussian Kernel case.

7.6: Classification of respiratory diseases using machine learning

Respiratory diseases are diseases that affect the lungs and/or respiratory tract. These include asthma, chronic obstructive pulmonary disease (COPD), allergic rhinitis, work-related lung disease, and pulmonary hypertension. To diagnose these diseases, the doctor carries out a thorough physical examination and an interview. The characteristics of the disease and respiratory problems will also be examined and evaluated. Typical examination performed by the doctor is auscultation. It is a diagnostic system that consists of listening to the internal sounds of the body. The part of interest refers to the term with the meaning of listening, through a stethoscope, to the sounds produced by the respiratory system. The stethoscope can be placed in various parts of the chest, back, and neck. The normal sounds that can be heard on a subject’s chest during the inhalation phase are mainly generated in the lobar part of the respiratory tract. The respiratory sounds are generated by air turbulence in the airway. The characteristics of the sounds are very variable, you can notice differences from person to person that depend on weight, age, health, and other factors. There is also a variability with respect to the density of the gas breathed. Abnormal or pathological respiratory sounds are accidental sounds that are not part of the normal breathing cycle. In recent years, the classic stethoscope has been replaced by an electronic version that records the sounds from the lungs. Poreva et al. (2017) studied lung sounds to classify them automatically using some algorithms based on Machine Learning. The authors used the sounds recorded by the patients by extracting the coefficient of bicoherence that was used to train the algorithms. Five algorithms were tested: Support Vector Machine, Logistic regression, Bayes Classifier, k-Nearest Neighbors, and Decision tree. The results showed that the SVM classifier and the decision tree classifier returned the best performances of 88% and 77% accuracy, respectively.

7.7: Parkinson’s disease diagnosis with machine learning-based models

Parkinson’s disease is a degenerative disease of the central nervous system that affects 7 to 10 million people worldwide, with an average age of onset around 60 years. The prevalence of this disease is expected to double over the next 20 years mainly due to the growing aging of the population. This pathology manifests itself with various symptoms that can be motor or nonmotor and that can lead to various problems that are reflected in daily life. Quantitatively, the number of symptoms is very high, and it is difficult to consider all the subjective signs and symptoms. As a result, research often focuses on a certain aspect, leaving out or in any case not considering everything else, thus losing the overall vision of the subject. A motor symptom of this disease is the difficulty of language: The voice may be feebler, or it may present a loss of tonality and modulation, which leads the patient to speak in a rather monotonous way. Sometimes, a palilalia appears which manifests itself in a repetition of syllables, and there is a tendency to accelerate the emission of sounds and not to pronounce all the syllables. Mostafa et al. (2019) developed a methodology for the identification of Parkinson’s disease by classifying voice disorders using algorithms based on. The authors first extracted features from the dataset of recorded vocal sounds and then filtered these features through a Multiple Feature Evaluation Approach (MFEA) with a multi-agent system. They later used several Machine Learning-based algorithms to classify voice disorders: Decision Tree, Naïve Bayes, Neural Network, Random Forest, and Support Vector Machine. The models returned an accuracy ranging from 74% to 86%.

8: Processing drug discovery with machine learning

Drug research and development is an expensive, long, and inefficient process, which takes more than 10 years to transfer a new drug to the market, with a cost of several billion dollars and a high risk of failure. The need to overcome the limits in the development of new therapies is made even more evident in the light of the global health needs’ data. Half of the failures are due to lack of efficacy, while a quarter of the failures are due to problems of tolerability, both causes expressing the difficulty of selecting the right target for the disease under study (Stephenson et al., 2019). To optimize the process of discovery and development of new therapeutic molecules, it is therefore necessary to use the knowledge hidden in the complexity of the data made available by biomedical research in the most efficient way. However, thanks to the computational skills and methodologies of computer science and machine learning techniques, it is possible to manage and analyze the volume of biomedical data in an automated way, to extrapolate significant relationships, generate new hypotheses to be subjected to experimental verification, and predict with statistical method the occurrence of future phenomena, including efficacy and toxicity associated with drugs (Vamathevan et al., 2019).

The search for new drugs is long and complex. First, some molecule candidates for therapy are identified and tested on cell cultures. Those that are most effective go to the next stages, in vivo tests. Finally, clinical trials are conducted on human patients. This process often requires years of research and significant economic investments. In recent times, the development of technology is revolutionizing the process of finding new drugs (Ekins et al., 2019).

In most cases, the targets of drug action are functional proteins such as receptors, enzymes, transport proteins, and a cascade of intracellular events derived from the drug-substrate interaction that culminates in the final biological effect. In a smaller number of cases, the methods of interaction between drug and living matter are carried out differently without affecting macromolecular complexes. Finally, some categories of drugs interact directly with DNA (Klambauer et al., 2019).

Generally, the birth of a drug starts precisely from the identification of a pharmacological target in the context of a clinical condition of interest. Once a project has been planned for the creation of a new drug starting from a real therapeutic need, researchers must focus attention on a pharmacological target to treat a pathological condition. Thus, begins the process of developing a drug, a long journey in stages that requires the use of huge human and economic resources (Xiao and Sun, 2019).

In this long process, the use of artificial intelligence can play a crucial role in speeding up the process by identifying the essential characteristics and leaving out the superfluous ones. Machine learning can be of assistance in many activities used in the modeling of new drugs. Identification of protein sequences, virtual screening, prediction of bio activators, chemical synthesis, and prediction of toxicity are just some examples of activities that can be dealt with taking the help of algorithms based on Machine Learning (Zoffmann et al., 2019).

8.1: Analyzing clinical data using machine learning algorithms

Electronic devices connected to an intensive care unit (ICU) patient produce a large amount of data on the patient’s status. These data are often only used to activate alerts in the event of an emergency that alert medical personnel who in this way go to the patient to check their health. Currently, thanks to the algorithms based on Machine Learning, these data could be used to extract knowledge on the patient’s clinical path and help medical staff in predicting the evolution of the disease. The historical data of patients who have undergone a similar course of the disease could be used to train a model capable of predicting the future situation of a patient in hospital. A Machine Learning-based model can be developed to predict mortality of ICU patients using Medical Information Mart’s diagnostic codes (Jaworska et al., 2019).

This example is only an application that is well suited to the use of patient clinical data to generate predictive scenarios that support medical personnel in the decision-making process related to the therapy to be given to a patient. Other possible applications are the use of wearable sensors to manage a patient’s rehabilitation path. The collected data can be used to develop a model to recognize a patient’s behavior based on a recurrent neural network using collected sequential data (Khan et al., 2020).

A further source of data is represented by the Electronic Health Records (EHRs). EHR is a tool used to collect patient’s medical history data, data that are collected during meetings with healthcare professionals, for prevention or on episodes of illness. The data present in it, suitably reworked, will subsequently constitute a source of historical data useful for the management of the health system, alongside the more strictly administrative and organizational data. These data can be used to develop a model to predict future patient health based on the past EHR data (Wang et al., 2020).

8.2: Predicting thyroid disease using ensemble learning

The thyroid is an endocrine gland capable of secreting and synthesizing two hormones such as thyroxine (T4) and triiodothyronine (T3) which control numerous metabolic functions, act on the development of the central nervous system, and allow the organism to grow. The thyroid consists of two lobes connected by an isthmus and is located in the neck between the second and third tracheal rings, and inferior to the thyroid cartilage. The synthesis of thyroid hormones is divided into three phases (Cooper and Biondi, 2012).

  •  Iodine uptake. Follicular thyroid cell-mediated iodine uptake is the first step in the synthesis of thyroid hormones.
  •  Synthesis of thyroglobulin. The thyroglobulin synthesized in the endoplasmic reticulum is transferred to the Golgi apparatus where it is glycosylated, then stored in exocytotic vesicles and released into the cavity of the follicle.
  •  Iodide organization and iodotyrosine condensation. This process involves numerous reactions catalyzed by thyroid oxidase (TPO). TPO is an enzyme containing a heme group thanks to which it oxidizes the iodine collected by follicular cells.

Synthesis and secretion of thyroid hormones are regulated by extrathyroid factors that act on the processes according to a feedback-negative mechanism and intrathyroid factors of self-regulation dependent on iodine intake.

Thyroid diseases include both benign and functional pathologies that can be traced back to normal- and hyperfunctioning forms (depending on the quantity of thyroid hormones produced), as well as inflammatory and neoplastic pathologies.

The term hyperthyroidism refers to a clinical situation characterized by an increase in circulating thyroid hormones T3 and T4. Since thyroid hormones are the main regulators of metabolism, this condition leads to an increase in many metabolic reactions (Vanderpump, 2011).

Hypothyroidism is a clinical syndrome due to an insufficient action of the thyroid hormones at the tissue level and causes a slowdown of all metabolic processes. Hypothyroidism that develops during fetal and/or neonatal life determines an important and often permanent reduction of growth and neurological development processes, while hypothyroidism in adults, which has a high frequency (20.6%–20.8%), with greater frequency in the female sex, in old age and, in most cases, a consequence of the autoimmune pathology, causes a generalized slowdown of the metabolic processes (Fatourechi, 2001).

8.3: Machine learning-based applications for thyroid disease classification

The diagnosis of thyroid disorder requires experience and knowledge from the doctor. The diagnosis is made through a physical examination of the patient and a simultaneous interview to collect the symptoms he feels and subsequently examining blood tests from laboratory tests. Despite all these tests, it is not easy to diagnose and predict thyroid disorder with accuracy.

Shankar et al. (2018) have developed an algorithm for classifying thyroid disorders based on the kernel classifier process. To start with, they carried out a process of selecting features to reduce the number of variables to be used in the model to reduce the processing time of the algorithm and focus attention only on the most significant predictors. The authors used a multikernel SVM for the classification of the thyroid disorder. Support vector machines (SVMs) are a set of supervised learning methods that can be used for both classification and regression. Given two classes of multidimensional linearly separable patterns, among all possible separation hyperplanes, SVM determines the one capable of separating the classes with the greatest possible margin. In practice, the linear case is not easily identifiable, whereas nonlinear models or a combination of linear and nonlinear models are used to solve this problem. The authors used a combination of linear kernel and radial basis function kernel. The results obtained were satisfactory with an accuracy of 97%.

Tyagi et al. (2018) studied the classification of thyroid disorders using different machine learning-based algorithms. First, they used the Artificial Neural Networks, then a model based on Support Vector Machine, then they used Decision Tree, and finally they applied the k-Nearest Neighbor. In Decision Tree-based algorithms, classification functions are learned in the form of a tree where: each internal node represents a variable, an arc toward a child node represents a possible value for that property, and a leaf represents the predicted value for a class starting from the values of the other properties, which in the tree is represented by the path from the root node to the leaf node. The k-Nearest Neighbor algorithm is based on the concept of classifying an unknown sample considering the class of the k closest samples of the training set. The new champion will be assigned to the class to which most of the nearest k champions belong.

Ma et al. (2019) used single-photon emission computed tomography (SPECT) images to train a model based on convolutional neural networks to diagnose thyroid disorders. SPECT is a recent diagnostic test that allows you to reconstruct scintigraphy images relating to the distribution of a radioactive tracer substance in small doses into the patient’s organism to measure some biological and biochemical processes on the computer. Three types of disorders are classified: Graves’ disease, Hashimoto’s disease, and subacute thyroiditis. The authors developed a model based on a DenseNet architecture. A DenseNet network is made up of a series of dense blocks interspersed with pooling layers. A dense block consists of a sequence of convolutional feature maps, without polling, where the input of a map is the concatenation of the outputs of all the previous maps.

Poudel et al. (2019) used ultrasounds of the thyroid for the classification of the disorders adopting a methodology based on the combination of machine learning algorithms and autoregressive features. Thyroid diseases change their size and shape, and these changes are appreciably noticeable with the course of the disease. With the ultrasound of the thyroid it is possible to study the position, shape, structure, and size of this gland. Therefore, it is used in the study of chronic thyroid diseases. The analysis of ultrasound of the thyroid gland is not simple as it deals with low contrast images, with a consistent presence of noise and an uneven distribution. To automate the classification process of thyroid ultrasounds, the authors used three machine learning algorithms: Support Vector Machine, Artificial Neural Network, and Random Forest. For the extraction of the features to be used in the training of the model, a methodology based on autoregressive modeling was adopted, identifying 30 spectral functions. The proposed technology returned an accuracy of approximately 90% with all three methods.

Ouyang et al. (2019) classified thyroid nodules through linear and nonlinear machine learning algorithms. Early treatment of thyroid cancer through analysis of malignant thyroid nodules is crucial in the treatment of thyroid disorders. In this task, algorithms based on machine learning can support the work of clinicians in the classification of nodules based on the information contained in pathological reports or from data obtained with fine-needle aspiration (FNA). The authors compared the results obtained in the classification of nodules using linear and nonlinear algorithms. Ridge regression, Lasso-penalty, and Elastic Net (EN) were applied among the linear methods. As nonlinear methods the authors used random forest (RF), kernel-Support Vector Machines (k-SVMs), Neural Network (Nnet), kernel nearest neighborhood (k-NN), and Naïve Bayes (NB). The linear and nonlinear methods returned comparable results, among which the methods that returned the best performances were Random Forest and Kernel-Support Vector Machines.

8.4: Preprocessing the dataset

The goal of this work is to identify a thyroid disorder among three a priori defined classes. To do this, we will use the data contained in the dataset called Thyroid Disease Data Set from the Garavan Institute. These data were taken from the UCI Repository of Machine Learning databases (Dua and Graff, 2019). The database contains 7200 instances with 21 predictors, including 15 categorical and 6 real attributes. The response variable contains three classes: 1-normal (nonhypothyroid), 2-hyperthyroid, and 3-hypothyroid. The variables contained in the dataset are the following:

  1. 1. Age: real
  2. 2. Sex: categorical
  3. 3. On_thyroxine: categorical
  4. 4. Query_on_thyroxine: categorical
  5. 5. On_antithyroid_medication: categorical
  6. 6. Sick: categorical
  7. 7. Pregnant: categorical
  8. 8. Thyroid_surgery: categorical
  9. 9. I131_treatment: categorical
  10. 10. Query_hypothyroid: categorical
  11. 11. Query_hyperthyroid: categorical
  12. 12. Lithium: categorical
  13. 13. Goiter: categorical
  14. 14. Tumor: categorical
  15. 15. Hypopituitary: categorical
  16. 16. Psych: categorical
  17. 17. TSH: real
  18. 18. T3: real
  19. 19. TT4: real
  20. 20. T4U: real
  21. 21. FTI: real
  22. 22. Class: categorical

The ability of an algorithm to perform well on inputs never observed previously is called generalization. During the training phase, the efficiency of the network is evaluated by sending test data as input, to calculate the error associated with the test. The algorithm’s performance is improved by trying to minimize this error. This is a simple optimization problem. What separates machine learning from optimization is the minimization not of the training error, but of the error on generalization. The generalization error is defined as the expected value of the error on a new input. The expectation value is assessed on a series of different inputs, extracted from the distribution of inputs that we expect the system to meet in practice (Kawaguchi et al., 2019).

To guarantee the generalization of the algorithm, it is necessary to appropriately divide the available data into two subsets: training set and test set. The training set will be used for training the algorithm, while the test set will be authorized in the test phase. In this way, in the test phase the properly trained algorithm will be tested on data it has never seen before (Tabassum, 2020).

There are several techniques for splitting the dataset into the two-subset training and test sets. The most used are: Simple random sampling (SRS), Systematic sampling, Trial-and-error methods, Convenience sampling, and Stratified sampling. In this work, we have used the Simple random sampling (SRS) method (Reitermanova, 2010).

This is the most used method, given its efficiency and simplicity of implementation. Through this method, the samples are randomly selected with a uniform distribution: Each sample has the same probability of selection. Random selection ensures that there are no datasets with similar characteristics that are selected in one subset. In this study, 70% of the data (5040 samples) was used for the training set and the remaining 30% of the data (2160 samples) for the test set.

The hardware and software described in Tables 1 and 2 were used to tackle the problem of predicting thyroid disorders with the use of ensemble methods.

Table 1

Hardware requirements for the simulation.
Central Processing Unit (CPU):Intel Core i5 6th Generation processor or higher, and AMD equivalent processor
RAM:8 GB minimum, 16 GB or higher is recommended
Graphics Processing Unit (GPU):NVIDIA GeForce GTX 960 or higher
Operating System:Ubuntu or Microsoft Windows 10

Table 2

Software requirements for the simulation.
Programming Platform:Python
R
Library:TensorFlow
Scikit-Learn
Keras

8.5: AdaBoostM algorithm

The AdaBoostM algorithm is one of the most used variants of boosting. This method is indicated if our set of inputs is discreet and therefore you want to solve a classification problem. This algorithm sequentially trains individual models, encouraging them at each iteration to provide correct predictions regarding the most important instances, that is, those to which the greatest weight is attributed.

AdaBoostM initially assigns to each of the instances of the training set the same weight, after which the specific learning algorithm chosen to generate a classifier is applied and new weights are attributed, which will be decremented as regards the aforementioned training set instances correctly and incremented for those that are not (Freund and Schapire, 1996).

The AdaBoostM algorithm elaborates a complex classifier starting from simple classifiers:

fx=t=1Twthtx

si3_e

Here,

  •  wt is the weight of the tth observation
  •  ht(x) is the function that indicates the diversity between the hypothesis, that is, the prediction made by the classifier and the actual value of the predicted class

Since at each iteration of the learning algorithm a redistribution of the weights is carried out, from time to time we will obtain sets of instances that can be defined easier and others more difficult, that is, not yet correctly classified, on the basis of which the following classifiers will be built.

At each iteration (t), the new weights are updated only for correctly classified instances, while the weight of the unclassified instances initially remains unchanged. Subsequently, the distributed weights are normalized, so that their total sum remains unaltered. This is accomplished by multiplying and dividing the weight of each instance respectively by the sum of the old and new weights. This operation increases the importance of the features not yet classified.

Once the iterative construction of the individual models has been established, it is necessary to understand how to combine them together to obtain a prediction, so even the individual responses of the classifiers will be weighed, giving greater emphasis to those believed to make better predictions (Burduk et al., 2020).

The estimate that allows us to evaluate the performance of each element belonging to the ensemble is the prediction error, which, if it is close to zero, is an indicator of high accuracy.

The algorithm consists of the following steps:

  •  Input: training set xi ∈ X, yi ∈ Y
  •  Initialization D1i=1Nsi4_e
  •  For t = 1, …, T:
    1. 1. Train the weak ht classifier using tth distribution Dt minimizing the following error: ϵt=iDtihtxiyisi5_e
    2. 2. Set wt=ϵt1ϵtsi6_e
    3. 3. Update D using the following equation:

      Dt+1i=DtiewthtxiyihtxiyZt

      si7_e

Here, Zt is a constant used for the Dt + 1 normalization

  •  Output: Set the final classifier H(x) as follows:

    Hx=argmaxyϵYfxy=argmaxyϵYt=1Twthtx

    si8_e

To reach a conclusive prediction, the weighted votes in favor of each output class are added and the most quoted is chosen.

After training the algorithm using the training dataset, the model obtained is used to make the forecast on the test dataset. Finally, a comparison is made between the predicted and expected values. The results are proposed using a confusion matrix. A confusion matrix is a square matrix n × n, with n number of classes to predict. This matrix tells us how a classifier works with respect to the different classes, in fact the correctly predicted class number is positioned on the main diagonal. In this way, all values outside the main diagonal represent classification errors. Table 3 shows the results of the classification:

Table 3

Confusion matrix.
Predicted class
blank cellClass 1Class 2Class 3
True classClass 15300
Class 20952
Class 3402006

Table 3

From the analysis of the confusion matrix we can see that the classifier returned an excellent result on the test dataset. Recall that these data had not previously been provided to the classifier during the training phase. Out of 2160 observations, only 6 h were committed with a correct recognition of 2154 requests, obtaining an accuracy of 99.7%.

9: Conclusion

In this chapter, we have studied ensemble learning algorithms and how these algorithms can be used for the classification of health disorders. Ensemble learning methods provide decision-making processes with superior performance, compared to basic methods. This improvement in performance is due to greater complexity and a loss of interpretability compared to learning systems based on individual hypotheses. Learning the ensemble combines the predictions of hypothesis collections to achieve greater performance efficiency. We initially introduced the topic by analyzing various bibliographic contributions that used Machine Learning-based models in the context of HealthCare. Subsequently, we deepened the methodologies underlying Ensemble learning by showing different algorithms. Finally, we applied these methods to classify thyroid problems.

The techniques based on Ensemble Learning record ranking results superior to those of other algorithms, which recommends their use. On the other hand, the models obtained with the use of these techniques are difficult to interpret, as the output of the model is difficult to explain. This makes such methodologies less popular in the business world. Furthermore, ensemble methods are difficult to learn for technicians and any wrong selection can lead to lower predictive accuracy than an individual model. Finally, the training process with such methodologies is expensive both in terms of computation time and memory space. These weaknesses represent starting points for improving technologies based on Ensemble Learning. They also pose challenges to the scientific community to spread the use of these technologies more widely in the world of work.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
52.15.57.3