Quantitative data
Qualitative data
Quantitative data can be obtained only with the help of measurements and not through observations. This can be represented in the form of numerical values. Quantitative data can be further classified into continuous and discrete. The exact integer values are discrete data, whereas continuous data can be any value in a range. Qualitative data is a description of the characteristics of a subject. Usually qualitative data can be obtained from observations and cannot be measured. In other words, qualitative data may be described as categorical data, and quantitative data can be called numerical data.
For example, in the previous statement, “brown hair” describes a characteristic of the dog and is qualitative data, whereas “four legs” and “1.5m” are the quantitative data and are categorized as discrete and continuous data, respectively.
An Example of Structured Data
Student Roll Number | Marks | Attendance | Batch | Sex |
---|---|---|---|---|
111401 | 492/500 | 98% | 2011-2014 | Male |
111402 | 442/500 | 72% | 2011-2014 | Male |
121501 | 465/500 | 82% | 2012-2015 | Female |
121502 | 452/500 | 87% | 2012-2015 | Male |
Importance of Data Types in Data Science
Before starting to analyze data, it is important to know about the data types so you can choose the suitable analysis methods. The analysis of continuous data is different from the analysis of categorical data; hence, using the same analysis methods for both may lead to incorrect analysis.
For example, in statistical analysis where continuous data is involved, the probability of an exact event is zero, while the result can be different for discrete data.
You can also choose the visualization tools based on the data types. For instance, continuous data is usually represented using histograms, whereas discrete data can be visualized with the help of bar charts.
Data Science: An Overview
Data Requirements
To develop a data science project, the data scientists first understand the problem based on the client/business requirements and then define the objectives of the problem for analysis. For example, say a client wants to analyze the emotion of people on a government policy. First, the objectives of the problem can be set as “To collect the opinion of the people about the government policy.” Then, the data scientists decide on the kind of data that can support the objective and the resources of data. For the example problem, the possible data is social media data, including text messages and opinion polls of various categories of people, with information about their education level, age, occupation, etc. Before starting the data collection, a good work plan is essential for collecting the data from various sources. Setting the objectives and work plan can reduce the time spent collecting the data and can help to prepare the report.
Data Acquisition
There are many types of structured open data available on the internet that we call secondary data, because that kind of data is collected by somebody and structured into some tabular format. If the user wants to collect the data directly from a source, that is called primary data. Initially, the unstructured data is collected via many resources such as mobile devices, emails, sensors, cameras, direct interaction with people, video files, audio files, text messages, blogs, etc.
Data Preparation
- 1.
Data processing
- 2.
Data cleaning
- 3.
Data transformation
Data Processing
This step is important as it is required to check the quality of data while we import it from various sources. This quality checking is done to ensure that the data is in the correct data type, standard format, and has no typos or errors in the variables. This step will reduce data issues when doing analysis. Moreover, in this phase, the collected unstructured data can be organized in the form of structured data for analysis and visualization.
Data Cleaning
Duplicates
Human or machine errors
Missing values
Outliers
Inappropriate values
Duplicates
In the database, some data is repeated multiple times, which results in duplicates. It is better to check and remove the duplicates to reduce the overhead in computation during data analysis.
Human or Machine Errors
The data is collected from sources either by humans or by machines. Some errors are inevitable during this process due to human carelessness or machine failure. The possible solution to avoid these kinds of errors is to match the variables and values with standard ones.
Missing Values
While converting the unstructured data into a structured form, some rows and columns may not have any values (i.e., empty). This error will cause discontinuity in the information and make it difficult to visualize it. There are many built-in functions available in programming languages we can use to check if the data has any missing values.
Outliers
Transforming the Data
Data transformation can be done by many methods using normalization, min-max operations, correlation information, etc.
Data Visualization
Based on the requirements of the user, the data can be analyzed with the help of visualization tools such as charts, graphs, etc. These visualization tools help people to understand the trends, variations, and deviations in a particular variable in the data set. Visualization techniques can be used as a part of exploratory data analysis.
Data Analysis
The data can be further analyzed with the help of mathematical techniques such as statistical techniques. The improvements, deviations, and variations are determined in a numerical form. We can also generate an analysis report by combining the results of visualization tools and analysis techniques.
Modeling and Algorithms
Today many machine learning algorithms are employed to predict useful information from raw data. For example, neural networks can be used to identify the users who are willing to donate funds to orphans based on the users’ previous behavior. In this scenario, the previous behavior data of users can be collected based on their education, activities, occupation, sex, etc. The neural network can be trained with this collected data. Whenever a new user’s data is fed to this model, it can predict whether the new user will give funds or not. However, the accuracy of the prediction is based on the reliability and the amount of data used while training.
There are many machine learning algorithms available such as regression techniques, support vector machine (SVM), neural networks, deep neural networks, recurrent neural networks, etc., that can be applied to data modeling. After data modeling, the model can be analyzed by giving data from new users and developing a prediction report.
Report Generation/Decision-Making
Finally, a report can be developed based on the analysis with the help of visualization tools, mathematical or statistical techniques, and models. Such reports can be helpful in many circumstances such as forecasting the strengths and weakness of an organization, industry, government, etc. The facts and findings from the report can make the decisions quite easy and intelligent. Moreover, the analysis report can be generated automatically using some automation tools based on the client requirements.
Recent Trends in Data Science
Certain fields in data science are growing exponentially and therefore will be attractive to data scientists. They are discussed in the following sections.
Automation in Data Science
In the current scenario, data science still needs a lot of manual work such as data processing, data cleaning, and transforming the data. These steps consume a lot of time and computations. The modern world demands the automation of data science processes such as data processing, data cleaning, data transformations, analysis, visualization, and report generation. Hence, the automation field will be a top demand in the data science industry.
Artificial Intelligence–Based Data Analyst
Artificial intelligence techniques and machine learning algorithms can be implemented effectively for modeling the data. Particularly, reinforcement learning with deep neural networks is used to upgrade the learning of the model based on variations in the data. Also, machine learning techniques can be used for automated data science projects.
Cloud Computing
The amount of data used by people nowadays has increased exponentially. Some industries gather a large amount of data every day and hence find it difficult to store and analyze with the help of local servers. This makes it expensive in terms of computation and maintenance. So, they prefer cloud computing in which the data can be stored on cloud servers and can be retrieved anytime and anywhere for analysis. Many cloud computing companies offer a data analytics platform on their cloud servers. The more growth in data processing, the more this field will gain attention.
Edge Computing
Many small-scale industries don’t require the analysis of data on cloud servers and instead require analysis reports instantly. For these kinds of applications, edge devices can be a possible solution to acquire the data, analyze it, and present a report in visual form or numerical form instantly to the users. In the future, the requirements of edge computing will increase significantly.
Natural Language Processing
Natural language processing (NLP) can be used to extract unstructured data from websites, emails, servers, log files, etc. In addition, NLP can be useful for converting text into a single data format. For example, we can convert people’s emotion into a data format from their messages on social media. This will be a powerful tool for collecting data from many sources, and its demand will continue to increase.
Why Data Science on the Raspberry Pi?
Many books explain the different processes involved in data science in relation to cloud computing. But in this book, the concepts of data science will be discussed as part of real-time applications using the Raspberry Pi. The Raspberry Pi boards can interact with the real-time world by connecting to a wide range of sensors using their general-purpose input/output (GPIO) pins, which makes it easier to collect real-time data. Owing to their small size and low cost, a number of nodes of these Raspberry Pi boards can be connected as a network, thereby enabling localized operation. In other words, the Raspberry Pi can be used as an edge computing device for data processing and storage, closer to the devices used for acquiring the information and thereby overcoming the disadvantages associated with cloud computing. Therefore, a lot of data processing applications can be implemented using a distribution of these devices that can manage real-time data and run the analytics locally. This book will help you to implement real-time data science applications using the Raspberry Pi.