Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 1

An Introduction to Big Data

John Panneerselvam, Lu Liu, and Richard Hill

Abstract

Data generation has increased drastically over the past few years, leading enterprises dealing with data management to swim in an enormous pool of data. Data management has also grown in importance because extracting the significant value out of a huge pile of raw data is of prime important for enterprises to make business decisions. The governance and management of an organization's data involve orchestrating both people and technology in such a way that the data become a valuable asset for both enterprises and society. With the drastic volume of data being generated every day and the growing importance of data management, understanding of Big Data is a fundamental requirement for those who wish to gain new insight into future challenges. This chapter introduces the concept of Big Data and gives an overview of the types, nature, advantages, and applications of Big Data in today's technological domain.

Keywords

Cloud; Datasets; Dynamic; Processing; Raw; Real time; Sources; Value

What Is Big Data?

Today, roughly half of the world population interacts with online services. Data are generated at an unprecedented scale from a wide range of sources. The way we view and manipulate the data is also changing, as we discover new ways of discovering insights from unstructured data sources. Managing data volume has changed considerably over recent years (Malik, 2013), because we need to cope with demands to deal with terabytes, petabytes, and now even zettabytes. Now we need to have a vision that includes what the data might be used for in the future so that we can begin to plan and budget for likely resources. A few terabytes of data are quickly generated by a commercial business organization, and individuals are starting to accumulate this amount of personal data. Storage capacity has roughly doubled every 14 months over the past 3 decades. Concurrently, the price of data storage has reduced, which has affected the storage strategies that enterprises employ (Kumar et al., 2012) as they buy more storage rather than determine what to delete. Because enterprises have started to discover new value in data, they are treating it like a tangible asset (Laney, 2001). This enormous generation of data, along with the adoption of new strategies to deal with the data, has caused the emergence of a new era of data management, commonly referred to as Big Data.

Big Data has a multitude of definitions, with some research suggesting that the term itself is a misnomer (Eaton et al., 2012). Big Data challenges the huge gap between analytical techniques used historically for data management, as opposed to what we require now (Barlow, 2013). The size of datasets has always grown over the years, but we are currently adopting improved practices for large-scale processing and storage. Big Data is not only huge in terms of volume, it is also dynamic and has various forms. On the whole, we have never seen these kinds of data in the history of technology.

Broadly speaking, Big Data can be defined as the emergence of new datasets with massive volume that change at a rapid pace, are very complex, and exceed the reach of the analytical capabilities of commonly used hardware environments and software tools for data management. In short, the volume of data has become too large to handle with conventional tools and methods.

With advances in science, medicine, and business, the sources that generate data increase every day, especially from electronic communications as a result of human activities. Such data are generated from e-mail, radiofrequency identification, mobile communication, social media, health care systems and records, enterprise data such as retail, transport, and utilities, and operational data from sensors and satellites. The data generated from these sources are usually unprocessed (raw) and require various stages of processing for analytics. Generally, some processing converts unstructured data into semi-structured data; if they are processed further, the data are regarded as structured. About 80% of the world’s data are semi-structured or unstructured. Some enterprises largely dealing with Big Data are Facebook, Twitter, Google, and Yahoo, because the bulk of their data are regarded as unstructured. As a consequence, these enterprises were early adopters of Big Data technology.

The Internet of Things (IoT) has increased data generation dramatically, because patterns of usage of IoT devices have changed recently. A simple snapshot event has turned out to be a data generation activity. Along with image recognition, today’s technology allows users to take and name a photograph, identify the individuals in the picture, and include the geographical location, time and date, before uploading the photo over the Internet within an instance. This is a quick data generation activity with considerable volume, velocity, and variety.

How Different Is Big Data?

The concept of Big Data is not new to the technological community. It can be seen as the logical extension of already existing technology such as storage and access strategies and processing techniques. Storing data is not new, but doing something meaningful (Hofstee et al., 2013) (and quickly) with the stored data is the challenge with Big Data (Gartner, 2011). Big Data analytics has something more to do with information technology management than simply dealing with databases. Enterprises used to retrieve historical data for processing to produce a result. Now, Big Data deals with real-time processing of the data and producing quick results (Biem et al., 2013). As a result, months, weeks, and days of processing have been reduced to minutes, seconds, and even fractions of seconds. In reality, the concept of Big Data is making things possible that would have been considered impossible not long ago.

Most existing storage strategies followed a knowledge management–based storage approach, using data warehouses (DW). This approach follows a hierarchy flowing from data to information, knowledge, and wisdom, known as the DIKW hierarchy. Elements in every level constitute elements for building the succeeding level. This architecture makes the accessing policies more complex and most of the existing databases are no longer able to support Big Data. Big Data storage models need more accuracy, and the semi-structured and the unstructured nature of Big Data is driving the adoption of storage models that use cross-linked data. Even though the data relate to each other and are physically located in different parts of the DW, logical connection remains between the data. Typically we use algorithms to process data in standalone machines and over the Internet. Most or all of these algorithms are bounded by space and time constraints, and they might lose logical functioning if an attempt is made to exceed their bound limitations. Big Data is processed with algorithms (Gualtieri, 2013) that possess the ability to function on a logically connected cluster of machines without limited time and space constraints.

Big Data processing is expected to produce results in real time or near–real time, and it is not meaningful to produce results after a prolonged period of processing. For instance, as users search for information using a search engine, the results that are displayed may be interspersed with advertisements. The advertisements will be for products or services that are related to the user’s query. This is an example of the real-time response upon which Big Data solutions are focused.

More on Big Data: Types and Sources

Big Data arises from a wide variety of sources and is categorized based on the nature of the data, their complexity in processing, and the intended analysis to extract a value for a meaningful execution. As a consequence, Big Data is classified as structured data, unstructured data, and semi-structured data.

Structured Data

Most of the data contained in traditional database systems are regarded as structured. These data are particularly suited to further analysis because they are less complex with defined length, semantics, and format. Records have well-defined fields with a high degree of organization (rows and columns), and the data usually possess meaningful codes in a standard form that computers can easily read. Often, data are organized into semantic chunks, and similar chunks with common description are usually grouped together. Structured data can be easily stored in databases and show reduced analytical complexity in searching, retrieving, categorizing, sorting, and analyzing with defined criteria.

Structured data come from both machine- and human-generated sources. Without the intervention of humans for data generation, some machine-generated datasets include sensor data, Web log data, call center detail records, data from smart meters, and trading systems. Humans interact with computers to generate data such as input data, XML data, click stream data, traditional enterprise data such as customer information from customer relationship management systems, and enterprise resource planning data, general ledger data, financial data, and so on.

Unstructured Data

Conversely, unstructured data lack a predefined data format and do not fit well into the traditional relational database systems. Such data do not follow any rules or recognizable patterns and can be unpredictable. These data are more complex to explore, and their analytical complexity is high in terms of capture, storage, processing, and resolving meaningful queries from them. More than 80% of data generated today are unstructured as a result of recording event data from daily activities.

Unstructured data are also generated by both machine and human sources. Some machine-generated data include image and video files generated from satellite and traffic sensors, geographical data from radars and sonar, and surveillance and security data from closed-circuit television (CCTV) sources. Human-generated data include social media data (e.g., Facebook and Twitter updates) (Murtagh, 2013; Wigan and Clarke, 2012), data from mobile communications, Web sources such as YouTube and Flickr, e-mails, documents, and spreadsheets.

Semi-structured Data

Semi-structured data are a combination of both structured and unstructured data. They still have the data organized in chunks, with similar chunks grouped together. However, the description of the chunks in the same group may not necessarily be the same. Some of the attributes of the data may be defined, and there is often a self-describing data model, but it is not as rigid as structured data. In this sense, semi-structured data can be viewed as a kind of structured data with no rigid relational integration among datasets. The data generated by electronic data interchange sources, e-mail, and XML data can be categorized as semi-structured data.

The Five V’s of Big Data

As discussed before, the conversation of Big Data often starts with its volume, velocity, and variety. The characteristics of Big Data—too big, too fast, and too hard—increase the complexity for existing tools and techniques to process them (Courtney, 2012a; Dong and Srivatsava, 2013). The core concept of Big Data theory is to extract the significant value out of the raw datasets to drive meaningful decision making. Because we see more and more data generated every day and the data pile is increasing, it has become essential to introduce the veracity nature of the data in Big Data processing, which determines the dependability level of the processed value.

Volume

Among the five V’s, volume is the most dominant character of Big Data, pushing new strategies in storing, accessing, and processing Big Data. We live in a society in which almost all of our activities are turning out to be a data generation event. This means that enterprises tend to swim in an enormous pool of data. The data are ever-growing at a rate governed by Moore’s law, which states that the rate at which the data are generated is doubling approximately in a period of just less than every 2 years. The more devices generate data, the more the data pile up in databases. The data volume is measured more in terms of bandwidth than its scale. A quick revolution of data generation has driven data management to deal with terabytes instead of petabytes, and inevitably to move to zettabytes in no time. This exponential generation of data reflects the fact that the volume of tomorrow’s data will always be higher than what we are facing today.

Social media sites such as Facebook and Twitter generate text and image data through uploads in the range of terabytes every day. A survey report of the Guardian (Murdoch, Monday May 20, 2013) says that Facebook and Yahoo carry out analysis on individual pieces of data that would not fit on a laptop or a desktop machine. Research studies of IBM (Pimentel, 2014) have projected a mammoth volume of data generation up to 35 zettabytes in 2020.

Velocity

Velocity represents the generation and processing of in-flight transitory data within the elapsed time limit. Most data sources generate high-flux streaming data that travel at a very high speed, making the analytics more complex. The speed at which the data are being generated demands more and more acceleration in processing and analyzing. Storing high-velocity data and then later processing them is not in the interest of Big Data. Real-time processing defines the rate at which the data arrive at the database and the time scale within which the data must be processed. Big Data likes low latency (i.e., shorter queuing delays) to reduce the lag time between capturing the data and making them accessible. With applications such as fraud detection, even a single minute is too late. Big Data analytics are targeted at responding to the applications in real time or near–real time by parallel processing of the data as they arrive in the database. The dynamic nature of Big Data leads the decisions on currently arriving data to influence the decisions on succeeding data. Again, the data generated by social media sites are proving to be very quick in velocity. For instance, Twitter closes more than 250 million tweets per day at a flying velocity (O’Leary, 2013) and tweets always escalate the velocity of data, considerably influencing the following tweets.

Variety

Variety of Big Data reveals heterogeneity of the data with respect to its type (structured, semi-structured, and unstructured), representation, and semantic interpretation. Because the community using IoT is increasing every day, it also constitutes a vast variety of sources generating data such as images, audio and video files, texts, and logs. Data generated by these various sources are ever-changing in nature, leaving most of the world’s data in unstructured and semi-structured formats. The data treated as most significant now may turn out not to be significant later, and vice versa.

Veracity

Veracity relates to the uncertainty of data within a data set. As more data are collected, there is a considerable increase in the probability that the data are potentially inaccurate or of poor quality. The trust level of the data is more significant in the processed value, which in turn drives decision making. This veracity determines the accuracy of the processed data in terms of their social or business value and indicates whether Big Data analytics has actually made sense of the processed data. Achieving the desired level of veracity requires robust optimization techniques and fuzzy logic approaches. (For additional challenges to Big Data veracity, see Chapters 17 and 18.)

Value

Value is of vital importance to Big Data analytics, because data will lose their meaning without contributing significant value (Mitchell et al., 2012; Schroeck et al., 2012). There is no point in a Big Data solution unless it is aimed at creating social or business value. In fact, the volume, velocity, and variety nature of Big Data are processed to extract a meaningful value out of the raw data. Of the data generated, not necessarily all has to be meaningful or significant for decision making. Relevant data might just be a little sample against a huge pile of data. It is evident that the non-significant data are growing at a tremendous rate in relation to significant ones. Big Data analytics must act on the whole data pile to extract significant data value. The process is similar to mining for scarce resources; huge volumes of raw ore are usually processed to extract the quantity of gold that has the most significant value.

Big Data in the Big World

Importance

There is clear motivation to embrace the adoption of Big Data solutions, because traditional database systems are no longer able to handle the enormous data being generated today (Madden, 2012). There is a need for frameworks and platforms that can effectively handle such massive data volumes, particularly to keep up with innovations in data collection mechanisms via portable digital devices. What we have dealt with so far are still its beginnings; much more is to come. The growing importance of Big Data has pushed enterprises and leading companies to adapt Big Data solutions for progressing towards innovation and insights. HP reported in 2013 that nearly 60% of all companies would spend at least 10% of their innovation budget on Big Data that business year (HP, 2013). It also found that more than one in three enterprises had actually failed with a Big Data initiative. Cisco estimates that the global IP traffic flowing over the Internet will reach 131.6 exabytes per month by 2015, which was standing at 51.2 exabytes per month in 2013 (Cisco, 2014).

Advantages and Applications

Big Data analytics reduces the processing time of a query and in turn reduces the time to wait for the solutions. Combining and analyzing the data allows data-driven (directed) decision making, which helps enterprises to grow their business. Big Data facilitates enterprises to take correct, meaningful actions at the right time and in the right place. Handelsbanken, a large bank in northern Europe, has experienced on average a sevenfold reduction in query processing time. They used newly developed IBM software (Thomas, 2012) for data analytics to achieve this growth. Big Data analytics provides a fast, cheap, and rich understanding of problems facing enterprises.

Real-time data streaming increasingly has a lead role in assisting human living. The KTH Royal Institute of Technology in Sweden analyzed real-time data streams to identity traffic patterns using the IBM components of Big Data solutions (Thomas, 2012). Real-time traffic data are collected by using global positioning systems (GPS) from a variety of sources such as radars in vehicles, motorways, and weather sensors. Using IBM InfoSphere stream software, large volumes of real-time streaming data in both structured and unstructured formats are analyzed (Hirzel et al., 2013). The value extracted from the data is effectively used to estimate the traveling time between source and destination points. Advice is then offered regarding alternative traveling routes, which serves to control and maintain city traffic.

As discussed earlier, processing of massive sets of data is becoming more effective and feasible, in just fractions of seconds. The level of trustworthiness of the data is an important consideration when processing data. Enterprises are concerned about risk and compliance management with respect to data assets, particularly because most information is transmitted over a network connection. With these applications, we are enjoying significant benefits such as better delivery of quality, service satisfaction, and increased productivity. The more the data are integrated and the more complex they are, the greater the significance is of the risk it poses to an organization if the data are lost or misused. The need to safeguard data is directly reflected in continuity planning and the growth of the organization. With this in mind, about 61% of managers reported that they have plans to secure data with Big Data solutions (Bdaily, 2013). Big Data is big business. Research studies indicate that the companies investing more in Big Data are generating greater returns and gaining more advantages than companies without such an investment. Companies leaning toward Big Data solutions are proving to be tough and strong competitors in the industry. People are attracted by the policies of online shopping providers such as eBay because they provide a wide access and availability zone, offers such as free shipping, tax reductions, and so on. A huge community might concurrently be using the same site over the Internet, and Big Data allows the service providers to manage such a heavy load without issues including network congestion, bandwidth and server issues, and traffic over the Internet affecting users’ experience.

Big Data applications have important roles in national security activities such as defense management, disaster recovery management, and financial interventions (Gammerman et al., 2014). Generally, governments consider securing financial interventions to be one of their primary tasks for fighting against international crimes. Financial transactions across countries involve various levels of processing and every level may include different languages and currencies and different methods of processing and different economic policies. Such a process also includes information exchange through various sources including voice calls, e-mail communication, and written communication. As discussed, most of these sources generate data in unstructured formats lacking inherent knowledge. Big Data solutions facilitate effective processing of such diverse datasets and provide better understanding of inter- and intra-dependency factors such as customs, human behavior and attitudes, values and beliefs, influence and institutions, social and economic policies, and political ideologies, and thus enable the right decisions at the right time to benefit the right entities. Big Data solutions also allow potential risk calculation, gross domestic product management, and terrorism modeling by delving into historical records along with current data despite the huge pile of datasets.

Big Data is used in health care applications to improve quality and efficiency in the delivery of health care and reduce health care maintenance expenditure. With real-time streaming capability, Big Data applications are also used to continuously monitor patient records, helping in the early detection and diagnosis of medical conditions. Big Data solutions benefit local education authorities, which involve various levels of digital data processing under several layers of local governments in the United Kingdom. Local education authorities function as a collective organization, and Big Data solutions offer easy provisions in funding management, progress monitoring, human resource management, coordination of admissions, etc. Similarly, we see Big Data applications in a variety of domains such as online education, cyber security, and weather tracking.

Analytical Capabilities of Big Data

Data Visualization

Visualization of large volumes of data can enable data exploration to be performed more efficiently. Such exploration helps identifying valuable data patterns and anomalies. Analytic capabilities provide a variety of data visualization types including bubble charts, bar-column representations, heat grids, scatter charts, and geospatial maps such as three-dimensional building models, digital terrain models, and road/rail networks. Modern software interfaces allow the construction of more complicated and sophisticated reports for the user’s consumption.

Greater Risk Intelligence

Big Data allows greater visibility and transparency of the organization’s risk profile. Reactive discovery of current incidents and proactive analysis of emerging issues help enterprises to reduce the risks of being susceptible to fraud and internal hackers. To this end, Big Data analysis benefits governments to strengthen their security policies and law enforcements by the way of instantly identifying suspicious behaviors. The sophisticated machines and environmental sensors in enterprises often generate data related to their operating and health conditions. Such data are referred to as Machine-to-Machine (M2M) data and are usually ignored because of their massive volume. Big Data analytics provides the analysis capability to help maintain equipment components in a timely, preventive manner, thus avoiding costly industrial downtime and increasing operational efficiency.

Satisfying Business Needs

The three major requirements of enterprises are operations, risk, and value management. In the beginning, it is difficult for enterprises to identify which data sources have significant value and which data are worth keeping. Keeping abundant data that are low in value is of little use. Big Data analytics enhances the organization’s capability of making quick decisions by providing a rapid estimation of the significant value of the data, thus supporting time and value management. With the aid of Big Data solutions, simultaneous management of operations, risk, and data value is now within the reach of more enterprises. Thus, uncovering the hidden values of under-rated data highly benefits organizations and governments with deeper data mining and helps them to identify the motivation of activities involving digital transactions and to prevent frauds and international crimes.

Predictive Modeling and Optimization

Predictive insights refers to the ability of organizations to better understand the risks involved in their actions, driving the process towards the intended outcome and guiding the adoption of design modifications in the processing architecture. Big Data platforms allow the extraction of these predictive insights by running multiple iterations of the relevant data. Massively parallel processing of Big Data allows organizations to develop and run predictive modeling to optimize results. Predictive models are run in an environment that hosts the relevant data, avoiding the complex process of moving massive data across networks. Such a strategy is referred to as database analytics. The architecture is optimized by reducing risks involved with the applications, avoiding any infrastructure failure, and designing an appropriate processing model based on the requirements of the data and the required outcome.

Streaming Analytics

The efficiency of the operating nodes is usually configured in such a way that they can be reused to handle streaming data. Real-time analysis of streamed data is made possible without much data loss and at minimum cost. The architecture is now capable of processing data streaming from various data sources to integrate the necessary data in an instant and generate outputs with the lowest possible processing time. Queries can be executed in a continuous fashion enabled by a high degree of parallelism and automatic optimization.

Identifying Business Use Cases

Often, data gain value in novel situations. Once a useful pattern emerges by means of data exploration, the value sources can be identified and further refined to gain quality benefits. These quality benefits are visualized by the analytics and should be in line with the objectives set forth by the organization.

Video and Voice Analytics

Voice and video data are generated mostly in an unstructured format. In addition, the streaming of such data results in velocity issues as well as high analytics complexity. Big Data platforms can perform effective capturing of such data with the aid of NoSQL databases. In some cases, the image file of the video may be uninteresting because of its massive volume. However, the value in the image data is significant for forming the metadata. Big Data analytics is capable of extracting individual frames of videos and important transcripts of audio files. Voice and video analytics have a vital role in enterprises such as call centers, telecommunication networks, CCTV surveillance, and so on.

Geospatial Analytics

Geospatial data sources include land drilling, offshore drilling, abandoned wells, and mobile data. Poor governance of such data often has a profound impact on a business’s goals and may cause considerable economic loss. Big Data analytics aids the effective maintenance of geospatial data and contributes towards more effective productivity management. Big Data processing allows intelligent exploitation of hidden information, which in turn facilitates high-resolution image exploitation and advanced geospatial data fusion for effective interpretation. Such advancements in data interpretation also highly benefit antiterrorism activities facilitated by accurate GPS tracking, unmanned aerial vehicles and remotely piloted aircraft videos, and so on.

An Overview of Big Data Solutions

Google BigQuery

Google BigQuery (Google) is a cloud-based SQL platform that offers real-time business insights with an analysis capability up to several terabytes of data in seconds. Iterative analysis of massive sets of data organized into billions of rows is available through a user interface. BigQuery can be accessed by making calls to the BigQuery Rest API supported by a variety of client libraries such as PHP, Java, and Python.

IBM InfoSphere BigInsights

This is a software framework from IBM that uses Hadoop to manage large volumes of data of all formats. BigInsights increases operational efficiency by augmenting the data warehouse environments, thereby allowing storage and analysis of the data without affecting the data warehouse (Harriott, 2013). Operational features of the framework include administration, discovery, deployment, provision, and security.

Big Data on Amazon Web Services

Amazon Elastic MapReduce, built around the Hadoop framework, provides an interface to Big Data analytics tools. Its DynamoDB is a NOSQL-based database that allows the storage of massive data in the cloud. Amazon Web Services Big Data solutions are targeted at reducing the up-front costs for enterprises and may thus be more cost-effective for smaller enterprises.

Clouds for Big Data

With the adoption of cloud services to increase business agility, Big Data applications will drive enterprises even more quickly toward clouds. Clouds are offering services instantly, and rapid provisioning (Ji et al., 2012) is available at a moment’s notice. Experiments performed by Big Data applications are much easier in clouds than hosting them internally in the enterprise. With its wide availability, clouds permit elastic provisioning of the resource requirements for Big Data processing (Talia, 2013). The “try before you commit” feature of clouds is particularly attractive to enterprises that are constantly trying to gain a competitive advantage in the marketplace. Some of the Big Data solutions such as Google BigQuery (see above) are only available as cloud services. Because the parallelization process involves a huge architecture, building such an environment with the required hardware and software tools is typically prohibitive for small-sized enterprises. In such a case, adopting cloud services proves to be a better option for enterprises, reducing implementation costs to levels that are viable and sustainable (Courtney, 2012b).

Apart from these advantages, there are, of course, drawbacks to processing Big Data in clouds. Pushing massive volumes of data over any network inevitably risks the overall performance of the infrastructure and also reduces availability and Quality of Service (QoS). Optimizing the bandwidth required to analyze the mammoth amount of Big Data is one of the remaining open challenges to cloud vendors (Liu, 2013). Often, enterprises want data to be processed in one physical location rather than a distributing processing across geographically remote clusters. Enterprises also need to front the migration costs involved in moving data to the cloud, because at times application deployment can cost three times as much as base software costs.

Conclusions

Developments in technology have started to uncover the concept of Big Data, and the understanding and reach of Big Data are becoming better than before. Big Data has applications in a wide range of technological implementations and will change the way that human activities are recorded and analyzed to produce new insights and innovations. Ultimately, this facilitates new capabilities by the way of faster responses to more complex data queries and more accurate prediction models. Big Data is particularly suited to applications in which the barriers to progress are fundamentally grounded in datasets that are large in volume, fast moving (velocity), wide ranging in variety, and difficult to trust (veracity), and have potential high value. Big Data is a potential solution to applications proving to be complex and time- and energy-consuming with current machine capabilities, leading us to a new and innovative way of dealing with business and science. Big Data allows us to do big business with big opportunities leading the way for big quality of life. Resolving the lesser relevance of Big Data and extending the deployment of Big Data solutions to every possible type of digital processing will uncover the value of underestimated and undervalued data and benefit humans to discover new innovations in technology.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 1. An Introduction to Big Data

Create new playlist

Sign In

Sign Up