6 | Big Data Simplied
Consistency implies that only a valid data will be written to a database. If, for some reason,
a transaction violates the database’s consistency rules, then the entire transaction should be
rolled back, and the database will be restored to a state that was consistent with those rules.
On the other hand, if a transaction successfully executes as per the consistency rules of the
database, the database moves from one state that is consistent with the rules to another state
that is also consistent with those rules.
Isolation requires that multiple transactions occurring on the database at the same time
should not impact one another. For example, if John carries out a transaction on a database at
the same time when Jane executes a different transaction, then both should operate isolated
from each other. The database should either perform John’s transaction in entirety before
executing Jane’s or vice versa. This prevents one transaction from reading intermediate data
produced as a side effect of part of the other transaction, which may not eventually be com-
mitted to the database, resulting in erroneous outcomes. It is to be noted that the isolation
property does not mandate which transaction should execute first. It merely suggests that
transactions will not interfere with each other.
Durability ensures that any transaction that is committed to the database will not be lost.
This is ensured through the use of database backups and transaction logs that facilitate the
restoration of committed transactions in the event of any software or hardware failures.
Database administrators can use a number of strategies to ensure ACID compliance.
The strategy used to enforce atomicity and durability is write-ahead logging (WAL) in which
the details of any transaction are first written to a log that includes both redo and undo informa-
tion. This ensures that, given a software or hardware failure of any sort, the database can check
the log and compare the contents of the log to the current state of the database.
Another technique used to address atomicity and durability is shadow paging in which a
shadow page is created whenever the data is to be modified. By this strategy, the updates of the
query are written to the shadow page rather than to the real data in the database. The database is
modified only when the edit is complete.
Finally, there is another strategy called the two-phase commit protocol, which is especially
useful in distributed database systems. This protocol splits up the request to modify the data into
two phases, namely a commit-request phase and a commit phase. In the request phase, all nodes
of the DBMS on the network, which are affected by the transaction, must confirm that they have
received the request and currently have the capacity to complete the transaction. Once confir-
mation is received from all relevant nodes, the commit phase is completed in which the data is
actually modified.
1.2.2 Unstructured Data
Unstructured data represents around 80% of the data being used today. It mainly includes
text and multimedia content. Some examples of unstructured data include word process-
ing documents, e-mail messages, videos, photos, audio les, presentations, webpages, etc.
Unstructured data is omnipresent. Most individuals and organizations throughout their life
generate heavy volumes of unstructured data. Unstructured data is either produced by a
machine or a human being.
M01 Big Data Simplified XXXX 01.indd 6 5/10/2019 9:56:18 AM
A Closer Look at Data | 7
Figure 1.1 shows a picture and textual data as examples of unstructured data.
Some of the machine-generated unstructured data are as follows.
Satellite images: It includes weather data, images captured by a government in its satellite
surveillance imagery. For instance, the data from Google Earth falls in this category.
Photographs and video: It include security and surveillance videos, images and videos
uploaded by users on social networking websites.
Radar or sonar data: It includes vehicular, meteorological and oceanographic profiles.
The following list shows a few examples of human-generated unstructured data.
Text internal to a company: It includes documents internal to the company like logs, survey
results, and e-mails which represent a large proportion of the textual information dealt by
any business in a daily basis.
Social media data: It includes data generated from social media platforms, such as YouTube,
Facebook, Twitter, LinkedIn and Instagram.
FIGURE 1.1 Examples of unstructured data—pictures, texts
Database administrators use several strategies.
The strategy used to enforce atomicity and durability is
write-ahead logging (WAL) in which details of any
transaction are first written to a log that includes both
redo and undo information. This ensures that, given a
software or hardware failure of any sort, the database can
check the log and compare the contents of the log to the
current state of the database.
Another technique used to address atomicity and durability
is shadow paging in which a shadow page is created
whenever data is to be modified. By this strategy, the
updates of the query are written to the shadow page
rather than to the real data in the database. The database
is modified only when the edit is complete.
Picture copyright: Sourabh Mukherjee
M01 Big Data Simplified XXXX 01.indd 7 5/10/2019 9:56:18 AM
8 | Big Data Simplied
Mobile data: It includes data such as text messages, location information, usage patterns of
mobile apps, etc.
Website content: It includes data that arise from any website delivering unstructured
content.
Figure 1.2 illustrates the different kinds of unstructured data.
Due to the transaction of humongous data in today’s world, various formats, and high speeds
of creation and consumption, the analysis of unstructured data can be very challenging. However,
it must be noted that, unstructured data holds tremendous value for businesses that successfully
leverage it. As discussed in preceding sections, it is relatively simple to extract actionable insights
from structured data. The well-defined schema of a relational database makes it easy to extract
and analyse records. However, the analysis of unstructured data is quite different and difficult.
Ifan organization wishes to analyse unstructured data, then it needs specialized tools and skills
to operate further on that data. The number of tools and methods will vary with the number and
types of unstructured data being handled.
It will be worthwhile to examine the steps of leveraging unstructured data.
Working with Unstructured Data—Decision on Objectives: The rst step towards processing unstructured
data is to decide on the objectives of analysis and the types of data to be analysed. Forexample,
analysis of data from a sensor is very different from analysing textual unstructured data in an email
or from the social media. Similarly, checking the contents of an email for compliance issues is a
very different purpose from analysing network trafc data to arrive at technical support metrics.
FIGURE 1.2 Types of unstructured data
Unstructured data
Photos
Videos
Satellite images
Sensor data
E-mail texts
Social media data
Mobile data
M01 Big Data Simplified XXXX 01.indd 8 5/10/2019 9:56:19 AM
A Closer Look at Data | 9
Unstructured Data Examples
Analyse customer
communications in the
social media
Customer conversations on social media are great inputs for
understanding customer preferences, customer sentiments and
purchase intent that can be very useful. For instance, a product
manufacturer would love to know the opinion of a customer about
a newly launched product or newly introduced product feature.
Similarly, a retail company would be very interested to know what
products one intends to purchase in the near future, and is therefore,
anatural target for discounts or other purchase motivations.
Clickstream analysis For companies that deal in products, such as retail or consumer goods
companies, a signicant use case is the clickstream analysis, which
identies the order and pattern of navigation through one or more
pages in a website by an individual, and the understanding achieved
there from about the buying patterns and product afnities of a
customer.
Analysis for compliance Analytics on email communications, records of call centre
conversations, company documents are important parts of legal and
regulatory compliance requirements.
Working with Unstructured Data—Choice of Tools:
The next step is to choose the appropriate analytics
tool for the job and there are several choices to start with.
For instance, if there is a single data source to analyse, such as social media postings to drive
segmentation of customers for designing effective marketing campaigns, then the business can
choose social media analytics or sentiment analysis tools. However, if an organization wants to
mine information from textual data, then text analytics tools should be used.
Regardless of the tool used for analysis of unstructured data, the results should be clearly tab-
ulated and easy to visualize. In addition, in this modern ecosystem, reports should be accessible
through computers and mobile devices on browser-based clients.
Working with Unstructured Data—Planning the Technology Stack: Once the types of tools for analysing
unstructured data are decided upon, the nal step is to select the suitable technology infrastruc-
tures and platforms. There are several deployment choices available. One may choose a hardware-
based analytics tool. In that case, one needs to buy a storage system with native analytics.
Teradata Aster is an ideal example. Again, one may also go for one’s own storage infrastructure of
choice. That could be a highly-distributed architecture using Hadoop (to be covered in greater
details in subsequent chapters of this book). One can also run a software-only analytics tool like
SPSS or R on the data sources. Irrespective of the technology stack of choice or the deployment
method chosen, there are some key design considerations to deal with unstructured data.
If the data storage is planned to be on premise, then it will need to scale for massive data vol-
umes and high performance. If the key objective is the availability of real-time results, then one
M01 Big Data Simplified XXXX 01.indd 9 5/10/2019 9:56:19 AM
10 | Big Data Simplied
should design for high availability. If the business goal is meaningful analysis of historical trends,
then data durability becomes the key design criterion.
The analysis of data to extract actionable insights has always been a key business objective for
any enterprise. In the modern ecosystem, a significant percentage of that information is found in
the form of unstructured data either on the web or in applications being run on-premise. Those
businesses who can leverage this asset will increase their effectiveness, speeding up business
decisions which are going to be of better quality. With reduction in software and hardware costs,
these benefits are no longer restricted to large enterprises. The benefits are available even today
to small and large businesses that are serious about harvesting business intelligence from their
own data.
Comparing Structured and Unstructured Data: Contrary to popular perception, structured data often
complements unstructured data and they are not always mutually exclusive. For example, there
can be scenarios where structured data records hold unstructured data.
A suitable example is a web form that has several questions, each one with a list of options
in a drop-down menu, but there is also a field, such as one for Remarks, Comments or Detailed
Information, which allows the user to add freeform textual data. In this example, the answers
generated from the pick lists form structured data, whereas the freeform field creates unstruc-
tured data.
As for the differences between structured and unstructured data, there are many.
Besides the obvious difference from a storage perspective, where structured data can be stored
in a relational database, and unstructured data outside one, the other key difference between the
two forms of data is the ease of analysis.
There are several mature analytics tools in the market today for structured data. If an enter-
prise is using a traditional data mining tool, while users can run simple searches for texts across
textual unstructured data, then such a tool yields little value when it comes to analysing media,
network data, weblogs, customer interactions, and social media data due to the lack of orderly
internal structure. Even though when there are several unstructured data analytics tools in the
marketplace, there is no vendor or tool that is a clear winner.
Structured Data Unstructured Data
Key characteristics
Dened data models
Mostly textual
Easily searched
No dened data model.
Can be of different formats like text,
images, sound, video.
Difcult to search.
Resides in
Relational databases
Data warehouses
NoSQL databases
Data warehouses
On-premise /online applications
Data lakes
Generated by
Humans
Machines
Humans
Machines
(Continued)
M01 Big Data Simplified XXXX 01.indd 10 5/10/2019 9:56:19 AM
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.37.12