Menal Dahiya, Nikita Malik* and Sakshi Rana
Dept. of Computer Applications, Maharaja Surajmal Institute, Janakpuri, New Delhi, India
Fundamentally, data wrangling is an elaborate process of transforming, enriching, and mapping data from one raw data form into another, to make it more valuable for analysis and enhancing its quality. It is considered as a core task within every action that is performed in the workflow framework of data projects. Wrangling of data begins from accessing the data, followed by transforming it and profiling the transformed data. These wrangling tasks differ according to the types of transformations used. Sometimes, data wrangling can resemble traditional extraction, transformation, and loading (ETL) processes. Through this chapter, various kinds of data wrangling and how data wrangling actions differ across the workflow are described. The dynamics of data wrangling, core transformation and profiling tasks are also explored. This is followed by a case study based on a dataset on forest fires, modified using Excel or Python language, performing the desired transformation and profiling, and presenting statistical and visualization analyses.
Keywords: Data wrangling, workflow framework, data transformation, profiling, core profiling
Data wrangling, which is also known as data munging, is a term that involves mapping data fields in a dataset starting from the source (its original raw form) to destination (more digestible format). Basically, it consists of variety of tasks that are involved in preparing the data for further analysis. The methods that you will apply for wrangling the data totally depends on the data that you are working on and the goal you want to achieve. These methods may differ from project to project. A data wrangling example could be targeting a field, row, or column in a dataset and implementing an action like cleaning, joining, consolidating, parsing or filtering to generate the required output. It can be a manual or machine-driven process. In cases where datasets are exceptionally big, automated data cleaning is required.
Data wrangling term is derived and defined as a process to prepare the data for analysis with data visualization aids that accelerates the faster process [1]. If the data is accurately wrangled then it ensures that we have entered quality data into analytics process. Data wrangling leads to effective decision making. Sometimes, for making any kind of required manipulation in the data infrastructure, it is necessary to have appropriate permission.
During the past 20 years, processing on data and the urbanity of tools has progressed, which makes it more necessary to determine a common set of techniques. The increased availability of data (both structured and unstructured) and the utter volume of it that can be stored and analyzed has changed the possibilities for data analysis—many difficult questions are now easier to answer, and some previously impossible ones are within reach [2]. There is a need for glue that helps to tie together the various parts of the data ecosystem, from JSON APIs (JavaScript Object Notation Application Programming Interface) to filtering and cleaning data to creating understandable charts. In addition to classic typical data, quality criteria such as accuracy, completeness, correctness, reliability, consistency, timeliness, precision, and conciseness are also an important aspect [3].
Some tasks of data wrangling include:
This section presents a framework that shows how to work with data. As one moves through the process of accessing, transforming, and using the data, there are certain common sequences of actions that are performed. The goal is to cover each of these processes. Data wrangling also constitutes a promising direction for visual analytics research, as it requires combining automated techniques (example, discrepancy detection, entity resolution, and semantic data type inference) with interactive visual interfaces [4].
Before deriving direct, automated value we practice to derive indirect, human-mediated value from the given data. For getting the expected valuable result by an automated system, we need to assess whether the core quality of our data is sufficient or not. Report generation and then analyzing it is a good practice to understand the wider potential of the data. Automated systems can be designed to use this data.
This is how the data projects progress: starting from short-term answering of familiar questions, to long-term analyses that assess the quality and potential applications of a dataset and at last to designing the systems that will use the dataset in an automated way. Undergoing this complete process our data moves through three main stages of data wrangling: raw, refined, and production, as shown in Table 4.1.
Discovering is the first step to data wrangling. Therefore, in the raw stage, the primary goal is to understand the data and getting an overview of your data. To discover what kinds of records are in the data, how are the record fields encoded and how does the data relate to your organization, to the kinds of operations you have, and to the other existing data you are using. Get familiar with your data.
Table 4.1 Movement of data through various stages.
Primary objectives | Data stage | ||
Raw | Refined | Production | |
|
|
|
After seeing the trends and patterns that will be helpful to conceptualize what kind of analysis you may want to do and being armed with an understanding of the data, you can then refine the data for intense exploration. When you collect raw data, initially are in different sizes and shapes, and do not have any definite structure. We can remove parts of the data that are not being used, reshaping the elements that are poorly formatted, and establishing relationships between multiple datasets. Data cleaning tools are used to remove errors that could influence your downstream analysis in a negative manner.
Once the data to be worked with is properly transformed and cleaned for analysis after completely understanding it, it is time to decide if all the data needed for the task at hand is there. Once the quality of data and its potential applications in automated systems are understood, the data can be moved to the next stage, that is, the production stage. On reaching this point, the final output is pushed downstream for the analytical needs.
Only a minority of data projects ends up in the raw or production stages, and the majority end up in the refined stage. Projects ending in the refined stage will add indirect value by delivering insights and models that drive better decisions. In some cases, these projects might last multiple years [2].
There are mainly three actions that we perform in the raw data stage as shown in Figure 4.1.
Data ingestion is the shipment of data from variegated sources to a storage medium where it can be retrieved, utilized, and analyzed to a data warehouse, data mart or database. This is the key step for analytics. Because of the various kinds of spectrum, the process of ingestion can be complex in some areas. In less complex areas many persons get their data as files through channels like FTP websites, emails. Other more complex areas include modern open-source tools which permit more comminuted and real-time transfer of data. In between these, more complex and less complex spectrum are propriety platforms, which support a variety of data transfer. These include Informatica Cloud, Talend, which is easy to maintain even for the people who does not belong to technical areas.
In the traditional enterprise data warehouses, some initial data transformation operations are involved in ingestion process. After the transformation when it is totally matched to the syntaxes that are defined by the warehouse, the data is stored in locations which are predefined. In some cases, we have to add on new data to the previous data. This process of appending newly arrived data can be complex if the new data contains edit to the previous data. This leads to ingest new data into separate locations, where certain rules can be applied for merging during the process of refining. In some areas, it can be simple where we just add new records at the end of the prior records.
This stage occurs when the data that you are ingesting is unknown. In this case, you do not how to work with your data and what results can you expect from it. This leads to the actions that are related to the creation of metadata. One action is known as creating generic metadata, which focuses on understanding the characteristics of your data. Other action is of making a determination about the data’s value by using the characteristics of your data. In this action, custom metadata is created. Dataset contains records and fields, which means rows and columns. You should focus on understanding the following things while describing your data:
Based on the potential of your present data, sometimes, it is required to create custom metadata in the discovery process. Generic metadata is useful to know how to properly work with the dataset, whereas custom metadata is required to perform specific analysis.
After the ingestion and complete understanding of your raw data, the next essential step includes the refining of data and exploring the data through analyses. Figure 4.2 shows the actions performed in this stage. The primary actions involve in this stage are:
The all-embracing motive in designing and creating the refined data is to simplify the predicted analyses that have to perform. As we will not foresee all of the analyses that have to be performed; therefore, we look at the patterns that are derived from the initial analyses, draw insights and get inspired from them to create new analysis directions that we had not considered previously. After refining the datasets, we compile them or modify them. Very often, it is required to repeat the actions in refining stage.
In this stage, our data is transformed the most in the process of designing and preparing the refined data. While creating the metadata in the raw stage if there we any errors in the dataset’s accuracy, time, granularity, structure or scope, those issues must be resolved here during this stage.
After refining the data, we reach at a stage where we start getting valuable insights from the dataset, its time separating the analyses (Figure 4.3). By separating, it means that you will be able to detect which analyses you have to do on a regular basis and which ones were enough for one-time analyses.
Data wrangling is a core iterative process that throws up the cleanest, most useful data possible before you start your actual analysis [5]. Transformation is one of the core actions that are involved in data wrangling. Another task is profiling, and we need to quick iterate between these two actions. Now we will explore the transformation tasks that are present in the process of data wrangling.
These are the core transformation actions that we need to apply on the data:
These are the actions that are used to change the schema and form of the data. Structuring mainly involves shifting records around and organizing the data. It is a very simple kind of transformation; sometimes it can be just changing the order of columns within a table. It also includes summarizing record field values. In some cases, it is required to break record fields into subcomponents or combining fields together which results in a complex transformation.
The most complex kind of transformation is the inter-record structuring which includes aggregations and pivots of the data:
Aggregation—It allows switching in the granularity of the dataset. For example, switching from individual person to segment of persons.
Pivoting—It includes shifting entries (records) into columns (fields) and vice-versa.
These are the actions that are used to add elementary new records from multiple datasets to the dataset and strategize about how this new additional data might raise it. The typical structuring transformations are:
Besides joins and unions, insertion of metadata and computing new data entries from the existing data in your dataset which results in the generation of generic metadata is another common task. This inserted metadata can be of two types:
These are the actions that are used to resolve the errors or to fix any kind of irregularities if present in your dataset. It fixes the quality and consistency issues and makes the dataset clean. High data quality is not just desirable, but one of the main criteria that determine whether the project is successful, and the resulting information is correct [6]. It basically includes manipulating single column values within the rows. The most common type is to fix the missing or the NULL values in the dataset, implementing formatting and hence increasing the quality of data.
In order to understand your data before you start transforming or analyzing it, the first step is profiling. Profiling leads to data transformations. This helps in reviewing source data for content and better quality [7].
One challenge of data wrangling is that reformatting and validating data require transforms that can be difficult to specify and evaluate. For instance, splitting data into meaningful records and attributes often involves regular expressions that are error-prone and tedious to interpret [8, 9].
Profiling can be divided on the basis of unit of data they work on. There are two kinds of profiling:
There are two kinds of constraints in individual values profiling. These are:
It focuses on the formats, for example, the format of date is MM-DD-YYYY. Therefore, date value should be in this format only.
Semantic constraints are built in context or exclusive business logic; for example, your company is closed for business on a festival so no transactions should exist on that particular day. This helps us to determine if the individual record field value or entire record is valid or not.
This kind of profiling mainly focuses on the shape of values and how the data is distributed within a single record field or in the range of relationships between more than one record field.
For example, there might be higher retail sales in holidays than a non-holiday. Thus, you could build a set-based profile to ensure that sales are distributed across the month as it was expected.
Wrangle the data into a dataset that provides meaningful insights to carryout cleansing process; it requires writing codes in idiosyncratic characters in languages of Perl, R and editing manually with tools like MS-Excel [10].
The goal is to perform the following tasks:
Kandel et al. [11] have discussed a wide range of topics and problems in the field of data wrangling, especially with regard to visualization. For example, graphs and charts can help identify data quality issues, such as missing values.
In this above code, we created a DataFrame by the name of df_fire and in this DataFrame we have loaded a csv file using Pandas read_csv( ) function. Full Path and Name of the file is ‘brazilian-fire-dataset.csv’. The result is shown in Figure 4.6.
Here we can see that the total number of records is 6454 rows and there are five columns. The column “Number of Fires” is having float datatype.
In the first line of code, we are specifying the order of the column. In second line we have changed the datatype of column “Number of Fires” to Integer type. Then we will rearrange the columns in the dataset and print it. The result is shown in Figure 4.7 and Figure 4.8.
For displaying top 10 records of the dataset the .head() function is used as follows (Figure 4.9).
For displaying top 10 records of the dataset, we use .tail( ) function as follows (Figure 4.10).
To get the statistical summary of the data frame for all the columns we use the .describe() function. The result is shown in Figure 4.11.
Here, first we will get the maximum number of fires on any given day in the dataset by using the .max( ) function. Then we will display the record that is having this number of fires. The result is shown in Figure 4.12.
The result is shown in Figure 4.13 below.
For example, Acre-18452, Bahia-44718 etc. Here because of the .head() function we are able to see only top 10 values.
Following in Figure 4.15 code is given. Here Plot function in matplotlib is used.
In Figure 4.16, the line plots depict the values on the series of data points that are connected with straight lines.
For getting the values of total numbers of fires in a particular month, we will again use the GroupBy and aggregate function and get the month fires.
After getting the required data, we will plot the pie chart as given in Figure 4.18.
In Figure 4.18, we can see that the months of July, October, and November are having the highest numbers of fires. It is showing percentages of a whole, and it represents percentages at a set point in time. Pie charts do not show changes over time.
For plotting the bar graph, we have to get the values for the total number of fires in a particular year (Figure 4.19).
After getting the values of the year and the number of fires in descending order, we will plot the bar graph. We use bar function from Matplotlib to achieve it (Figure 4.20).
In Figure 4.20, it can be observed that the highest number of fires is in the year 2003 and the lowest is in 1998. The graph shows the number of fires in decreasing order.
With the increasing rate of data amount and vast quantities of diverse data sources providing this data, many issues are faced by organizations. They are being compelled to use the available data and to produce competitive benefits for pulling through in the long run. For this, data wrangling offers an apt solution, of which, data quality is a significant aspect. Actions in data wrangling can further be divided into three parts which describe how the data is progressed through different stages. Transformation and profiling are the core processes which help us to iterate through records, add new values and, to detect errors and eliminate them. Data wrangling tools also help us to discover the problems present in data such as outliers, if any. Many quality problems can be recognized by inspecting the raw data; others can be detected by diagrams or other various kinds of representations. Missing values, for instance, are indicated by gaps in the graphs, wherein the type of representation plays a crucial role as it has great influence.
3.129.67.246