© K. Mohaideen Abdul Kadhar and G. Anand 2021
K. M. Abdul Kadhar, G. AnandData Science with Raspberry Pihttps://doi.org/10.1007/978-1-4842-6825-4_1

1. Introduction to Data Science

K. Mohaideen Abdul Kadhar1   and G. Anand1
(1)
Pollachi, Tamil Nadu, India
 
Data is a collection of information in the form of words, numbers, and descriptions about the subject. Consider the following statement: “The dog has four legs, is 1.5m high, and has brown hair.” This statement has three different kinds of information (i.e., data) about the dog. The data “four” and “1.5m” is numerical data, and “brown hair” is descriptive. It is good to know the various kinds of data types to understand the data, perform effective analysis, and better extract knowledge from the data. Basically, data can be categorized into two types.
  • Quantitative data

  • Qualitative data

Quantitative data can be obtained only with the help of measurements and not through observations. This can be represented in the form of numerical values. Quantitative data can be further classified into continuous and discrete. The exact integer values are discrete data, whereas continuous data can be any value in a range. Qualitative data is a description of the characteristics of a subject. Usually qualitative data can be obtained from observations and cannot be measured. In other words, qualitative data may be described as categorical data, and quantitative data can be called numerical data.

For example, in the previous statement, “brown hair” describes a characteristic of the dog and is qualitative data, whereas “four legs” and “1.5m” are the quantitative data and are categorized as discrete and continuous data, respectively.

Data can be available in structured and unstructured form. When the data is organized in a predefined data model/structure, it is called structured data. Structured data can be stored in a tabular format or a relational database with the help of query languages. We can also store this kind of data in an Excel file format, like the student database given in Table 1-1.
Table 1-1

An Example of Structured Data

Student Roll Number

Marks

Attendance

Batch

Sex

111401

492/500

98%

2011-2014

Male

111402

442/500

72%

2011-2014

Male

121501

465/500

82%

2012-2015

Female

121502

452/500

87%

2012-2015

Male

Most human-generated and machine-generated data are unstructured data such as emails, documents, text files, log files, text messages, images, video and audio files, messages on the Web and social media, and data from sensors. This data can be converted to a structured format only through human or machine intervention. Figure 1-1 shows the various sources of unstructured data.
../images/496535_1_En_1_Chapter/496535_1_En_1_Fig1_HTML.jpg
Figure 1-1

Sources of unstructured data

Importance of Data Types in Data Science

Before starting to analyze data, it is important to know about the data types so you can choose the suitable analysis methods. The analysis of continuous data is different from the analysis of categorical data; hence, using the same analysis methods for both may lead to incorrect analysis.

For example, in statistical analysis where continuous data is involved, the probability of an exact event is zero, while the result can be different for discrete data.

You can also choose the visualization tools based on the data types. For instance, continuous data is usually represented using histograms, whereas discrete data can be visualized with the help of bar charts.

Data Science: An Overview

As discussed at the beginning of the chapter, data science is nothing but the extraction of knowledge or information from the data. Unfortunately, not all data gives useful information. It is based on the client requirements, the hypothesis, the nature of the data type, and the methods used for analysis and modeling. Therefore, a few processes are required before analyzing or modeling the data for intelligent decision-making. Figure 1-2 describes these data science processes.
../images/496535_1_En_1_Chapter/496535_1_En_1_Fig2_HTML.jpg
Figure 1-2

Data science process

Data Requirements

To develop a data science project, the data scientists first understand the problem based on the client/business requirements and then define the objectives of the problem for analysis. For example, say a client wants to analyze the emotion of people on a government policy. First, the objectives of the problem can be set as “To collect the opinion of the people about the government policy.” Then, the data scientists decide on the kind of data that can support the objective and the resources of data. For the example problem, the possible data is social media data, including text messages and opinion polls of various categories of people, with information about their education level, age, occupation, etc. Before starting the data collection, a good work plan is essential for collecting the data from various sources. Setting the objectives and work plan can reduce the time spent collecting the data and can help to prepare the report.

Data Acquisition

There are many types of structured open data available on the internet that we call secondary data, because that kind of data is collected by somebody and structured into some tabular format. If the user wants to collect the data directly from a source, that is called primary data. Initially, the unstructured data is collected via many resources such as mobile devices, emails, sensors, cameras, direct interaction with people, video files, audio files, text messages, blogs, etc.

Data Preparation

Data preparation is the most important part of the data science process. Preparing the data puts the data into proper form for knowledge extraction. There are three steps in the data preparation stage.
  1. 1.

    Data processing

     
  2. 2.

    Data cleaning

     
  3. 3.

    Data transformation

     

Data Processing

This step is important as it is required to check the quality of data while we import it from various sources. This quality checking is done to ensure that the data is in the correct data type, standard format, and has no typos or errors in the variables. This step will reduce data issues when doing analysis. Moreover, in this phase, the collected unstructured data can be organized in the form of structured data for analysis and visualization.

Data Cleaning

Once the data processing is done, cleaning the data is required as the data might still have some errors. These errors will affect the actual information present in the data. Possible errors are as follows:
  • Duplicates

  • Human or machine errors

  • Missing values

  • Outliers

  • Inappropriate values

Duplicates

In the database, some data is repeated multiple times, which results in duplicates. It is better to check and remove the duplicates to reduce the overhead in computation during data analysis.

Human or Machine Errors

The data is collected from sources either by humans or by machines. Some errors are inevitable during this process due to human carelessness or machine failure. The possible solution to avoid these kinds of errors is to match the variables and values with standard ones.

Missing Values

While converting the unstructured data into a structured form, some rows and columns may not have any values (i.e., empty). This error will cause discontinuity in the information and make it difficult to visualize it. There are many built-in functions available in programming languages we can use to check if the data has any missing values.

Outliers

In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be because of variability in the measurement or it may indicate experimental errors; outliers are sometimes excluded from the data set. Figure 1-3 shows an example of outlier data. Outlier data can cause problems with certain types of models, which in turn will influence the decision-making.
../images/496535_1_En_1_Chapter/496535_1_En_1_Fig3_HTML.jpg
Figure 1-3

Outlier data

Transforming the Data

Data transformation can be done by many methods using normalization, min-max operations, correlation information, etc.

Data Visualization

Based on the requirements of the user, the data can be analyzed with the help of visualization tools such as charts, graphs, etc. These visualization tools help people to understand the trends, variations, and deviations in a particular variable in the data set. Visualization techniques can be used as a part of exploratory data analysis.

Data Analysis

The data can be further analyzed with the help of mathematical techniques such as statistical techniques. The improvements, deviations, and variations are determined in a numerical form. We can also generate an analysis report by combining the results of visualization tools and analysis techniques.

Modeling and Algorithms

Today many machine learning algorithms are employed to predict useful information from raw data. For example, neural networks can be used to identify the users who are willing to donate funds to orphans based on the users’ previous behavior. In this scenario, the previous behavior data of users can be collected based on their education, activities, occupation, sex, etc. The neural network can be trained with this collected data. Whenever a new user’s data is fed to this model, it can predict whether the new user will give funds or not. However, the accuracy of the prediction is based on the reliability and the amount of data used while training.

There are many machine learning algorithms available such as regression techniques, support vector machine (SVM), neural networks, deep neural networks, recurrent neural networks, etc., that can be applied to data modeling. After data modeling, the model can be analyzed by giving data from new users and developing a prediction report.

Report Generation/Decision-Making

Finally, a report can be developed based on the analysis with the help of visualization tools, mathematical or statistical techniques, and models. Such reports can be helpful in many circumstances such as forecasting the strengths and weakness of an organization, industry, government, etc. The facts and findings from the report can make the decisions quite easy and intelligent. Moreover, the analysis report can be generated automatically using some automation tools based on the client requirements.

Recent Trends in Data Science

Certain fields in data science are growing exponentially and therefore will be attractive to data scientists. They are discussed in the following sections.

Automation in Data Science

In the current scenario, data science still needs a lot of manual work such as data processing, data cleaning, and transforming the data. These steps consume a lot of time and computations. The modern world demands the automation of data science processes such as data processing, data cleaning, data transformations, analysis, visualization, and report generation. Hence, the automation field will be a top demand in the data science industry.

Artificial Intelligence–Based Data Analyst

Artificial intelligence techniques and machine learning algorithms can be implemented effectively for modeling the data. Particularly, reinforcement learning with deep neural networks is used to upgrade the learning of the model based on variations in the data. Also, machine learning techniques can be used for automated data science projects.

Cloud Computing

The amount of data used by people nowadays has increased exponentially. Some industries gather a large amount of data every day and hence find it difficult to store and analyze with the help of local servers. This makes it expensive in terms of computation and maintenance. So, they prefer cloud computing in which the data can be stored on cloud servers and can be retrieved anytime and anywhere for analysis. Many cloud computing companies offer a data analytics platform on their cloud servers. The more growth in data processing, the more this field will gain attention.

Edge Computing

Many small-scale industries don’t require the analysis of data on cloud servers and instead require analysis reports instantly. For these kinds of applications, edge devices can be a possible solution to acquire the data, analyze it, and present a report in visual form or numerical form instantly to the users. In the future, the requirements of edge computing will increase significantly.

Natural Language Processing

Natural language processing (NLP) can be used to extract unstructured data from websites, emails, servers, log files, etc. In addition, NLP can be useful for converting text into a single data format. For example, we can convert people’s emotion into a data format from their messages on social media. This will be a powerful tool for collecting data from many sources, and its demand will continue to increase.

Why Data Science on the Raspberry Pi?

Many books explain the different processes involved in data science in relation to cloud computing. But in this book, the concepts of data science will be discussed as part of real-time applications using the Raspberry Pi. The Raspberry Pi boards can interact with the real-time world by connecting to a wide range of sensors using their general-purpose input/output (GPIO) pins, which makes it easier to collect real-time data. Owing to their small size and low cost, a number of nodes of these Raspberry Pi boards can be connected as a network, thereby enabling localized operation. In other words, the Raspberry Pi can be used as an edge computing device for data processing and storage, closer to the devices used for acquiring the information and thereby overcoming the disadvantages associated with cloud computing. Therefore, a lot of data processing applications can be implemented using a distribution of these devices that can manage real-time data and run the analytics locally. This book will help you to implement real-time data science applications using the Raspberry Pi.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
54.196.27.122