In mathematics and computer science, an algorithm is defined as an unambiguous specification of how to solve a problem, based on a sequence of logical-numerical deductions. Algorithm, in fact, is a general name given to the systematic method of any kind of numerical calculation or a route drawn for solving a problem or achieving a goal utilized in the mentioned fields. Tasks such as calculation, data processing and automated reasoning can be realized through algorithms. Algorithm, as a concept, has been around for centuries. The formation of the modern algorithm started with endeavors exerted for the solution of what David Hilbert (1928) called Entscheidungsproblem (decision problem). The following formalizations of the concept were identified as efforts for the definition of “effective calculability” or “effective method” with Gödel–Herbrand–Kleene recursive functions (1930, 1934 and 1935), lambda calculus of Alonzo Church (1936), “Formulation 1” by Emil Post (1936) and the Turing machines of Alan Turing (1936–37 and 1939). Since then, particularly in the twentieth century, there has been a growing interest in data analysis algorithms as well as their applications to interdisciplinary various datasets.
Data analysis can be defined as a process of collecting raw data and converting it into information that would prove to be useful for the users in their decision-making processes. Data collection is performed and data analysis is done for the purpose of answering questions, testing hypotheses or refuting theories. According to the statistician John Tukey (1961), data analysis is defined as the set of (1) procedures for analyzing data, (2) techniques for interpreting the results of these procedures and (3) methods for planning the gathering of data so that one can render its analysis more accurate and also much easier. It also comprises the entire mechanism and outcomes of (mathematical) statistics, which are applicable to the analyzing of data. Numerous ways exist for the classification of algorithms and each of them has its own merits.
Accordingly, knowledge itself turns into power when it is processed, analyzed and interpreted in a proper and accurate way. With this key motive in mind, our general aim in this book is to ensure the integration of relevant findings in an interdisciplinary approach, discussing various relevant methods, thus putting forth a common approach for both problems and solutions. The main aim of this book is to provide the readers with core skills regarding data analysis in interdisciplinary studies. Data analysis is characterized by three typical features: (1) algorithms for classification, clustering, association analysis, modeling, data visualization as well as singling out the singularities; (2) computer algorithms’ source codes for conducting data analysis; and (3) specific fields (economics, physics, medicine, psychology, etc.) where the data are collected.
This book will help the readers establish a bridge from equations to algorithms’ source codes and from the interpretation of results to draw meaningful information about data and the process they represent. As the algorithms are developed further, it will be possible to grasp the significance of having a variety of variables. Moreover, it will be showing how to use the obtained results of data analysis for the forecasting of future developments, diagnosis and prediction in the field of medicine and related fields. In this way, we will present how knowledge merges with applications.
With this concern in mind, the book will be guiding for interdisciplinary studies to be carried out by those who are engaged in the fields of mathematics, statistics, economics, medicine, engineering, neuroengineering, computer science, neurology, cognitive sciences and psychiatry and so on.
In this book, we will analyze in detail important algorithms of data analysis and classification. We will discuss the contribution gained through linear model and multilinear model, decision trees, naive Bayesian classifier, support vector machines, k-nearest neighbor and artificial neural network (ANN) algorithms. Besides these, the book will also include fractal and multifractal methods with ANN algorithm.
The main goal of this book is to provide the readers with core skills regarding data analysis in interdisciplinary datasets. The second goal is to analyze each of the main components of data analysis:
–Application of algorithms to real dataset and synthetic dataset
–Specific application of data analysis algorithm in interdisciplinary datasets
–Detailed description of general concepts for extracting knowledge from data, which undergird the wide-ranging array of datasets and application algorithms
Accordingly, each component has adequate resources so that data analysis can be developed through algorithms. This comprehensive collection is organized into three parts:
–Classification of real dataset and synthetic dataset by algorithms
–Singling out singularities features by fractals and multifractals for real dataset and synthetic datasets
–Achieving high accuracy rate for classification of singled out singularities features by ANN algorithm (learning vector quantization algorithm is one of the ANN algorithms).
Moreover, we aim to coalesce three scientific endeavors and pave a way for providing direction for future applications to
–real dataset and synthetic datasets,
–fractals and multifractals for singled out singularities data as obtained from real datasets and synthetic datasets and
–data analysis algorithms for the classification of datasets.
Main objectives are as follows:
Our book intends to enhance knowledge and facilitate learning, by using linear model and multilinear model, decision trees, naive Bayesian classifier, support vector machines, k-nearest neighbor, ANN algorithms as well as fractal and multifractal methods with ANN with the following goals:
–Understand what data analysis means and how data analysis can be employed to solve real problems through the use of computational mathematics
–Recognize whether data analysis solution with algorithm is a feasible alternative for a specific problem
–Draw inferences on the results of a given algorithm through discovery process
–Apply relevant mathematical rules and statistical techniques to evaluate the results of a given algorithm
–Recognize several different computational mathematic techniques for data analysis strategies and optimize the results by selecting the most appropriate strategy
–Develop a comprehensive understanding of how different data analysis techniques build models to solve problems related to decision-making, classification and selection of the more significant critical attributes from datasets and so on
–Understand the types of problems that can be solved by combining an expert systems problem solving algorithm approach and a data analysis strategy
–Develop a general awareness about the structure of a dataset and how a dataset can be used to enhance opportunities related to different fields which include but are not limited to psychiatry, neurology (radiology) as well as economy
–Understand how data analysis through computational mathematics can be applied to algorithms via concrete examples whose procedures are explained in depth
–Handle independent variables that have direct correlation with dependent variable
–Learn how to use a decision tree to be able to design a rule-based system
–Calculate the probability of which class the samples with certain attributes in dataset belong to
–Calculate which training samples the smallest k unit belongs to among the distance vector obtained
–Specify significant singled out singularities in data
–Know how to implement codes and use them in accordance with computational mathematical principles
Our intended audience are undergraduate, graduate, postgraduate students as well as academics and scholars; however, it also encompasses a wider range of readers who specialize or are interested in the applications of data analysis to real-world problems concerning various fields, such as engineering, medical studies, mathematics, physics, social sciences and economics. The purpose of the book is to provide the readers with the mathematical foundations for some of the main computational approaches to data analysis, decision-making, classification and selecting the significant critical attributes. These include techniques and methods for numerical solution of systems of linear and nonlinear algorithms. This requires making connections between techniques of numerical analysis and algorithms. The content of the book focuses on presenting the main algorithmic approaches and the underlying mathematical concepts, with particular attention given to the implementation aspects. Hence, use of typical mathematical environments, Matlab and available solvers/ libraries, is experimented throughout the chapters.
In writing this text, we directed our attention toward three groups of individuals:
–Academics who wish to teach a unit and conduct a workshop or an entire course on essential computational mathematical approaches. They will be able to gain insight into the importance of datasets so that they can decide on which algorithm to use in order to handle the available datasets. Thus, they will have a sort of guided roadmap in their hands as to solving the relevant problems.
–Students who wish to learn more about essential computational mathematical approaches to datasets, learn how to write an algorithm and thus acquire practical experience with regard to using their own datasets.
–Professionals who need to understand how essentials of computational mathematics can be applied to solve issues related to their business by learning more about the attributes of their data, being able to interpret more easily and meaningfully as well as selecting out the most critical attributes, which will help them to sort out their problems and guide them for future direction.
In writing this book, we have adopted the approach of learning by doing. Our book can present a unique kind of teaching and learning experience. Our view is supported by several features found within the text as the following:
–Plain and detailed examples. We provide simple but detailed examples and applications of how the various algorithm techniques can be used to build the models.
–Overall tutorial style. The book offers flowcharts, algorithms and examples from Chapter 4 onward so that the readers can follow the instructions and carry out their applications on their own.
–Algorithm sessions. Algorithm chapters allow readers to work through the steps of data analysis, solving a problem process with the provided mathematical computations. Each chapter is specially highlighted for easy differentiation from regular ordinary texts.
–Datasets for data analysis. A variety of datasets from medicine and economy are available for computational mathematics-oriented data analysis.
–Web site for data analysis. Links to several web sites that include dataset are provided as
UCI Machine Learning Repository
(http://archive.ics.uci.edu/ml/index.php)
Kaggle the Home of Data Science & Machine Learning
(http://kaggle.com)
The World Bank Data Bank
(http://data.worldbank.org)
–Data analysis tools. All the applications in the book have been carried out by Matlab software.
–Key term definitions. Each chapter introduces several key terms that are critical to understanding the concepts. A list of definitions for these terms is provided at the head of each chapter.
–Each chapter accompanied by some examples as well as their applications. Each chapter includes examples related to the subject matter being handled and applications that are relevant to the datasets being handled. These are grouped into two categories – computational questions and medical datasets as well as economy dataset applications.
–Computational questions have a mathematical flavor in that they require the reader to perform one or several calculations. Many of the computational questions are appropriate for challenging issues that address the students with advanced level.
–Medical datasets and economy dataset applications discuss advanced methods for classification, learning algorithms (linear model and multilinear model, decision trees algorithms (ID3, C4.5 and CART), naive Bayesian, support vector machines, k-nearest neighbor, ANN) as well as multifractal methods with the application of LVQ algorithm which is one of the ANN algorithms.
Medical datasets and economy dataset examples and applications with several algorithms’ applications are explained concisely and analyzed in this book.
When data analysis with algorithms is not a viable choice, other options for creating useful decision-making models may be available based on the scheme in Figure 1.1.
A brief description of the contents in each chapter is as follows:
–Chapter 2 offers an overview of all aspects of the processes related to dataset as well as basic statistical descriptions of dataset along with relevant examples. It also addresses explanations of attributes of datasets to be handled in the book.
–Chapter 3 introduces data preprocessing and model evaluation. It first introduces the concept of data quality and then discusses methods for data cleaning, data integration, data reduction and data transformation.
–Chapter 4 provides detailed information on definition of algorithm. It also elaborates on the following concepts: flow chart, variables, loop, and expressions for the basic concepts related to algorithms.
–Chapter 5 presents some basic concepts regarding linear model and multilinear model with details on our datasets (medicine and economics).
–Chapter 6 provides basic concepts and methods for classification of decision tree induction, including ID3, C4.5 and CART algorithms with rule-based classification. The implementation of the ID3, C4.5 and CART algorithms’ model application to medical datasets and economy dataset.
–Chapter 7 introduces some basic concepts and methods for probabilistic naive Bayes classifiers on Bayes’ theorem with strong (naive) independence assumptions between the features.
–Chapter 8 explains the principles on support vector machines modeling, introducing the basic concepts and methods for classification in a variety of applications that are linearly separable and linearly inseparable.
–Chapter 9 elaborates on nonparametric k-nearest neighbor as to how it is used for classification and regression. The chapter includes all available cases, classifying new cases based on a similarity measure. It is used for the prediction of the property of a substance in relation to the medical datasets and economy dataset for their most similar compounds.
–Chapter 10 presents two popular neural network models that are supervised and unsupervised, respectively. This chapter provides a detailed explanation of feedforward backpropagation and LVQ algorithms’ network training topologies/structures. Detailed examples have been provided, and applications on these neural network architectures are given to show their weighted connections during training with medical datasets and economy dataset.
–Chapter 11 emphasizes on the essential concepts pertaining to fractals and multifractals. The most important concept on the fractal methods highlights fractal dimension and diffusion-limited aggregation with the specific applications, which are concerned with medical datasets and economy dataset. Brownian motion analysis, one of the multifractal methods, is handled for the selection of singled out singularities with regard to applications in medical datasets and economy dataset. This chapter deals with the application of Brownian motion Hölder regularity functions (polynomial and exponential) on medical datasets and economy dataset, with the aim of singled out singularities attributes in such datasets. By applying the LVQ algorithm on these singularities attributes as obtained from this chapter, we propose a classification accordingly.
All the applications in the book have been carried out by Matlab software. Moreover, we will also use the following datasets:
We focus on issues related to economy dataset and medical datasets. There are three different datasets for this focus, which are made up of two synthetic datasets and one real dataset.
Economy (U.N.I.S.) dataset. This dataset contains real data about economies of the following countries: The United States, New Zealand, Italy and Sweden collected from 1960 to 2015. The dataset has both numeric and nominal attributes.
The dataset has been retrieved from http://databank.worldbank.org/data/home. aspx. Databank is an online resource that provides quick and simple access to collections of time series real data.
The multiple sclerosis (MS) dataset. MS dataset holds magnetic resonance imaging and Expanded Disability Status Scale information about relapsing remitting MS, secondary progressive MS, primary progressive MS patients and healthy individuals. MS dataset is obtained from Hacettepe University Medical Faculty Neurology Department, Ankara, Turkey. Data for the clinically definite MS patients were provided by Rana Karabudak, Professor (MD), the Director of Neuroimmunology Unit in Neurology Department.
In order to obtain synthetic MS dataset, classification and regression trees algorithm was applied to the real MS dataset. There are two ways to deal with synthetic dataset from real dataset: one is to develop an algorithm to obtain synthetic dataset from real dataset and the other is to use the available software (SynTReN, R-package, ToXene, etc.).
Clinical psychology dataset (revised Wechsler Adults Intelligence Scale [WAIS-R]). WAIS-R test contains the analysis of the adults’ intelligence. Real WAIS-R dataset is obtained through Majaz Moonis, Professor (M.D.) at the University of Massachusetts (UMASS), Department of Neurology and Psychiatry, of which he is the Stroke Prevention Clinic Director.
In order to obtain synthetic WAIS-R dataset, classification and regression trees algorithm was applied to the real WAIS-R dataset. There are two ways to deal with synthetic dataset from real dataset: one is to develop an algorithm to obtain synthetic dataset from real dataset and other is to use the available software (SynTReN, R-package, ToXene, etc.).
3.135.202.122