Preface

This book is about Data Science, one of the fastest growing fields with applications in almost all disciplines. The book provides a comprehensive overview of data science.

Data science is a data-driven decision making approach that uses several different areas, methods, algorithms, models, and disciplines with a purpose of extracting insights and knowledge from structured and unstructured data. These insights are helpful in applying algorithms and models to make decisions. The models in data science are used in predictive analytics to predict future outcomes. Machine learning and artificial intelligence (AI) are major application areas of data science.

Data science is a multidisciplinary field that provides the knowledge and skills to understand, process, and visualize data in the initial stages followed by applications of statistics, modeling, mathematics, and technology to address and solve analytically complex problems using structured and unstructured data. At the core of data science is data. It is about using this data in creative and effective ways to help businesses in making data-driven business decisions. Data science is about extracting knowledge and insights from data. Businesses and processes today are run using data. The amount of data collected now is in massive scale and is usually referred as the age of Big Data. The rapid advancement in technology is making it possible to collect, store, and process volumes of data rapidly. It is about using this data effectively using visualization, statistical analysis, and modeling tools that can help businesses driving business decisions.

The knowledge of statistics in data science is as important as the applications of computer science. Companies now collect massive amounts of data from exabytes to zettabytes, which are both structured and unstructured. The advancement in technology and the computing capabilities have made it possible to process and analyze this huge data with smarter storage spaces.

Data science is a multidisciplinary field that involves the ability to understand, process, and visualize data in the initial stages followed by applications of statistics, modeling, mathematics, and technology to address and solve analytically complex problems using structured and unstructured data. At the core of data science is data. It is about using this data in creative and effective ways to help businesses in making data-driven business decisions.

The field of data science is vast and has a wide scope. The terms data science, data analytics, business analytics, and business intelligence are often used interchangeably even by the professions in the fields. All these areas are somewhat related with the field of data science having the largest scope. This book tries to outline the tools, techniques, and applications of data science and explain the similarities and differences of this field with data analytics, analytics, business analytics, and business intelligence.

The knowledge of statistics in data science is as important as the applications of computer science. Statistics is the science of data and variation. Statistics and data analysis, and statistical analysis constitute major applications of data science. Therefore, a significant part of this book emphasizes the statistical concepts needed to apply data science in real world. It provides a solid foundation of statistics applied to data science. Data visualization and other descriptive and inferential tools—the knowledge of which are critical for data science professionals are discussed in detail. The book also introduces the basics of machine learning that is now a major part of data science and introduces the statistical programming language R, which is widely used by data scientists. A chapter by chapter synopsis is provided.

Chapter 1 provides an overview of data science by defining and outlining the tools and techniques. It describes the differences and similarities between data science and data analytics. This chapter also discusses the role of statistics in data science, a brief history of data science, knowledge and skills for data science professionals, and a broad view of data science with associated areas. The body of knowledge essential for data science, and different tools technologies used in data science are also parts of this chapter. Finally, the chapter looks into the future outlook of data science and carrier career path for data scientists along with future outlook of data science as a field. The major topics discussed in Chapter 1 are: (a) broad view of data science with associated areas, (b) data science body of knowledge, (c) technologies used in data science, (d) future outlook, and (d) career path for data science professional and data scientist.

The other concepts related to data science including analytics, business analytics, and business intelligence (BI) are discussed in subsequent chapters. Data science continues to evolve as one of the most sought-after areas by companies. The job outlook for this area continues to be one of the highest of all field.

The discussion topic of Chapter 2 is analytics and business analytics. One of the major areas of data science is analytics and business analytics. These terms are often used interchangeably with data science. We outline the differences between the two along with the explanation of different types of analytics and the tools used in each one. The decision-making process in data science heavily makes use of analytics and business analytics tools and these are integral parts of data analysis. We, therefore, felt it necessary to explain and describe the role of analytics in data science. Analytics is the science of analysis—the processes by which we analyze data, draw conclusions, and make decisions. Business analytics (BA) covers a vast area. It is a complex field that encompasses visualization, statistics and modeling, optimization, simulation-based modeling, and statistical analysis. It uses descriptive, predictive, and prescriptive analytics including text and speech analytics, web analytics, and other application-based analytics and much more. This chapter also discusses different predictive models and predictive analytics. Flow diagrams outlining the tools of each of the descriptive, predictive, and prescriptive analytics presented in this chapter. The decision-making tools in analytics are part of data science.

Chapter 3 draws a comparison between the business intelligence (BI) and business analytics. Business analytics, data, analytics, and advanced analytics fall under the broad area of business intelligence (BI). The broad scope of BI and the distinction between the BI and business analytics (BA) tools are outlined in this chapter.

Chapter 4 is devoted to the study of collection, presentation, and various classification of data. Data science is about the study of data. Data are of various types and are collected using different means. This chapter explained the types of data and their classification with examples. Companies collect massive amounts of data. The volume of data collected and analyzed by businesses is so large that it is referred to as “Big Data.” The volume, variety, and the speed (velocity) with which data are collected requires specialized tools and techniques including specially designed big data software for analysis.

In Chapter 5, we introduce Excel, a widely available and used software for data visualization and analysis. A number of graphs and charts with stepwise instructions are presented. There are several packages available as add-ins to Excel to enhance its capabilities. The chapter presents basic to more involved features and capabilities. The chapter is divided into sections including “Getting Stated with Excel” followed by several applications including formatting data as a table, filtering and sorting data, and simple calculations. Other applications in this chapter are analyzing data using pivot_table/pivot chart, descriptive statistics using Excel, visualizing data using Excel charts and graphs, visualizing categorical data—bar charts, pie charts, cross tabulation, exploring the relationship between two and three variables—scatter plot bubble graph, and time-series plot. Excel is very widely used software application program in data science.

Chapters 6 and 7 deal with basics of statistical analysis for data science. Statistics, data analysis, and analytics are at the core of data science applications. Statistics involves making decisions from the data. Making effective decisions using statistical methods and data require the understanding of three areas of statistics: (1) descriptive statistics, (2) probability and probability distributions, and (3) inferential statistics. Descriptive statistics involves describing the data using graphical and numerical methods. Graphical and numerical methods are used to create visual representation of the variables or data and to calculate various statistics to describe the data. Graphical tools are also helpful in identifying the patterns in the data. This chapter discusses data visualization tools. A number of graphical techniques are explained with their applications.

There has been an increasing amount of pressure on businesses to provide high-quality products and services. This is critical to improving their market share in this highly competitive market. Not only it is critical for businesses to meet and exceed customer needs and requirements, it is also important for businesses to process and analyze a large amount of data (in real time, in many cases). Data visualization, processing, analysis, and using data timely and effectively are needed to drive business decisions and also make timely data-driven decisions. The processing and analysis of large data sets comes under the emerging field known as big data, data mining, and analytics.

To process these massive amounts of data, data mining uses statistical techniques and algorithms and extracts nontrivial, implicit, previously unknown, and potentially useful patterns. Because applications of data mining tools are growing, there will be more of a demand for professionals trained in data science and analytics. The knowledge discovered from this data in order to make intelligent data driven decisions is referred to as business intelligence (BI) and business analytics. These are hot topics in business and leadership circles today as it uses a set of techniques and processes which aid in fact-based decision making. These concepts are discussed in various chapters of the book.

Much of the data analysis and statistical techniques we discuss in Chapters 6 and 7 are prerequisites to fully understanding data science and business analytics.

In Chapter 8, we discuss numerical methods that describe several measures critical to data science and analysis. The calculated measures are also known as statistics when calculated from the sample data. We explained the measures of central tendency, measures of position, and measures of variation. We also discussed empirical rule that relates the mean and standard deviation and aid in the understanding of what it means for a data to be normal. Finally, in this chapter, we study the statistics that measure the association between two variables—covariance and correlation coefficient. All these measures along with the visual tools are essential part of data analysis.

In data analytics and data science, probability and probability distributions play an important role in decision making. These are essential parts of drawing conclusion from the data and are used in problems involving inferential statistics. Chapter 9 provides a comprehensive review of probability.

Chapter 10 discusses the concepts of random variable and discrete probability distributions. The distributions play an important role in the decision-making process. Several discrete probability distributions including the binomial, Poisson, hypergeometric, and geometric distributions were discussed with applications. The second part of this chapter deals with continuous probability distribution. The emphasis is on normal distribution. The normal distribution is perhaps the most important distribution in statistics and plays a very important role in statistics and data analysis. The basis of quality programs such as, Six Sigma is the normal distribution. The chapter also provides a brief explanation of exponential distribution. This distribution has wide applications in modeling and reliability engineering.

Chapter 11 introduces the concepts of sampling and sampling distribution. In statistical analysis, we almost always rely on sample to draw conclusion about the population. The chapter also explains the concepts of standard error and the concept of central limit theorem.

Chapter 12 discusses the concepts of estimation, confidence intervals, and hypothesis testing. The concept of sampling theory is important in studying these applications. Samples are used to make inferences about the population, and this can be done through sampling distribution. The probability distribution of a sample statistic is called its sampling distribution. We explained the central limit theorem. We also discussed several examples of formulating and testing hypothesis about the population mean and population proportion. Hypothesis tests are used in assessing the validity of regression methods. They form the basis of many of the assumptions underlying the regression analysis to be discussed in the coming chapters.

Chapter 13 provides the basics of machine learning. It is a widely used method in data science and is used in designing systems that can learn, adjust, and improve based on the data fed to them without being explicitly programmed. Machine Learning is used to create models from huge amount of data commonly referred to as big data. It is closely related to artificial intelligence (AI). In fact, it is an application of artificial intelligence (AI). Machine learning algorithms are based on teaching a computer how to learn from the training data. The algorithms learn and improve as more data flows through the system. Fraud detection, e-mail spam, and GPS systems are some examples of machine learning applications.

Machine learning tasks are typically classified into two broad categories: supervised learning and unsupervised learning. These concepts are described in this chapter.

Finally, in Chapter 14, we introduce R statistical software. R is a powerful and widely used software for data analysis and machine learning applications. This chapter introduced the software and provided the basic statistical features, and instructions on how to download R and R studio. The software can be downloaded to run on all major operating systems including Windows, Mac OS X, and Unix. It is supported by R Foundation for Statistical Computing. R statistical analysis programming language was designed for statistical computing and graphics and is widely used by statisticians, data mining,36 and data science professionals for data analysis. R is perhaps one of the most widely used and powerful programming platforms for statistical programming and applied machine learning. It is widely used for data science and analysis application and is a desired skill for data science professionals.

The book provides a comprehensive overview of data science and the tools and technology used in this field. The mastery of the concepts in this book are critical in the practice of data science. Data science is a growing field. It continues to evolve as one of the most sought-after areas by companies. A career in data science is ranked at the third best job in America for 2020 by Glassdoor and was ranked the number one best job from 2016 to 2019. Data scientists have a median salary of $118,370 per year or $56.91 per hour. These are based on level of education and experience in the field. Job growth in this field is also above average, with a projected increase of 16 percent from 2018 to 2028.

Salt Lake City, Utah, U.S.A.
[email protected]
[email protected]

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.14.130.24