Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 10

Tools and Technologies for the Implementation of Big Data

Richard J. Self, and Dave Voorhis

Abstract

This chapter uses the five V’s of Big Data (volume, velocity, variety, veracity, and value) to form the basis for consideration of the current status and issues relating to the introduction of Big Data analysis into organizations. The first three are critical to understanding the implications and consequences of available choices for the techniques, tools, and order to provide an understanding of choices that need to be made based on understanding the nature of the data sources and the content. All five V’s are invoked to evaluate some of the most critical issues involved in the choices made during the early stages of implementing a Big Data analytics project. Big Data analytics is a comparatively new field; as such, it is important to recognize that elements are currently well along the Gartner hype cycle into productive use. The concept of the planning fallacy is used with information technology project success reference class data created by the Standish Group to improve the success rates of Big Data projects. International Organization for Standardization 27002 provides a basis considering critical issues raised by data protection regimes in relation to the sources and locations of data and processing of Big Data.

Keywords

Analysis; Data protection; Governance; Hype cycle; Implementation; Paradigms; Project success factors; Techniques; Technologies; Tools; Trust; Value; Veracity

Introduction

This chapter provides an overview and critical analysis of the current status and issues likely to affect the development and adoption of Big Data analytics. It covers the conceptual foundations and models of the subject, the varied technologies and tools that are relevant to practitioners, and a range of considerations that affect the effective implementation and delivery of Big Data solutions to organizational strategies and information requirements.

The three V’s (Laney, 2001; Gartner, 2011) present inherent technical challenges to managing and analyzing Big Data:

• Volume implies that there are too much data to house or process using commodity hardware and software. “Too much” is deliberately vague. What is considered too much grows over time—to date, the upper bound of what can be stored or managed is always increasing—and depends on the requirements and budget. A Big Data volume for a small under-funded charity organization may be considered small and trivially manageable by a large well-funded commercial enterprise with extensive storage capacity.

• Velocity implies that new data arrive too quickly to manage or analyze or analysis is required too quickly or too often to support with commodity hardware and software. Of course, “too quickly” is as vague as “too much data.”

• Variety implies that there is too much variation in data records or too many diverse sources to use conventional commodity software easily. Conventional data analysis software typically assumes data that are consistently formatted or perhaps are already housed in, for example, a structured query language database management system (SQL DBMS).

The presence of any one of the three V’s is sufficient to acquire a label of “Big Data” because “big” refers to sizable complexity or difficulty and not necessarily volume. Conversely, if all three aspects of velocity, volume, and variety of data are manageable with conventional commodity hardware and software, by this definition it is not Big Data. It might be “Large Data” (i.e., it occupies a manageable amount of space but presumably is still vast, for some agreeable definition of “vast”) or it might just be “data” that are rapidly changing and/or the results are needed in a hurry, but not unfeasibly so, and/or is diverse but not unmanageably so.

“Commodity hardware and software” is not rigorously defined here. It can be taken to mean anything from off-the-shelf desktop data analysis tools such as Microsoft Excel and Access; to enterprise relational DBMSs such as Oracle Database, Microsoft SQL Server, or IBM DB2; to data analysis and business intelligence suites such as SAS by SAS Software or SAP Business Objects. “Big Data” implies, in general, that such tools are inadequate, and so are the techniques on which they are based. As a result, a number of specialized techniques have emerged to tackle Big Data problems. The remainder of this chapter will describe some of these techniques, followed by some of the tools that implement them.

Techniques

This section is divided into two parts. The first deals primarily with the volume aspect of Big Data: how to represent and store it. The second deals essentially with the velocity aspect, i.e., methods to produce results in an acceptably timely fashion. Both sections mention, as appropriate, variety.

Representation, Storage, and Data Management

This section highlights a representative, rather than comprehensive, selection of techniques used to represent and store Big Data, focusing on approaches that can manage a volume of data that is considered infeasible with conventional database technologies such as SQL DBMSs. In general, as of late 2013, in-memory or direct-attached storage is generally favored over network accessible storage (Webster, 2011).

Distributed Databases

Distributed databases are database systems in which data are stored on multiple computers that may be physically co-located in a cluster or geographically distant. Distributed database systems can include massively parallel processing (MPP) databases and data-mining grids. Distributed database systems typically shield the user from accessing the data storage directly by accessing it via a query language. The variety aspect of Big Data can make traditional storage approaches challenging, with the result that column stores or key-value stores—rather than SQL-based storage—are commonplace. Alternatives to SQL are commonly known as NoSQL.

Massively Parallel Processing Databases

Massively parallel processing databases are distributed database systems specifically engineered for parallel data processing. Each server has memory and processing power, which may include both central processing units and graphics processing units (Ectors, 2013), to process data locally (DeWitt and Gray, 1992). All communication is via networking; no disks are shared, in what is termed a “shared nothing” architecture (Stonebraker, 1986). This can help address the velocity and volume issues of Big Data. Database query languages designed specifically for MPP databases may be employed (Chaiken et al., 2008).

Data-Mining Grids

Unlike a cluster, which is a group of closely coupled computers dedicated to high-performance parallel computing, a grid is a collection of computers that are relatively loosely coupled and that may leave or join the grid arbitrarily, but which collectively (typically via software running on the individual machines) support high-performance, massively parallel computing (Foster and Kesselman, 2004). When this computational capacity is dedicated to data analysis, it is considered a data-mining grid and often takes the form of a collection of tools to facilitate data mining using grid infrastructure (Cannatoro and Pugliese, 2004). This can help address the velocity and volume issues of Big Data.

Distributed File Systems

Distributed file systems are approaches to storing files that involve multiple computers to extend storage capacity, speed, or reliability—typically via redundancy (Levy and Silberschatz, 2009). This can help address the velocity and volume issues. Unlike distributed database systems, which typically shield the user from direct access to stored data, distributed file systems are specifically designed to give users and their applications direct access to stored data.

Cloud-Based Databases

Cloud-based database systems are distributed database systems specifically designed to run on cloud infrastructure (Amazon, 2013). They may be SQL or NoSQL based.

Analysis

This section presents a selection of Big Data analysis approaches, organized alphabetically. Unless otherwise noted, the assumption is that the data are already stored in or accessible via one of the approaches described in the previous section. The descriptions below are intended to be brief introductions rather than detailed overviews and the focus is generally on categories of analysis rather than specific techniques. The latter would be too extensive to cover here and would only duplicate existing texts on statistics, machine learning, artificial intelligence, business intelligence, and the like.

A/B Testing

A/B testing is the process of using randomized experiments to verify which of two advertising campaigns or advertising approaches is most effective. For example, a mail campaign soliciting donations may not be sure which wording is best at generating a response: “Donate now, we are counting on your help!” or “We need your donation!” Using A/B testing, a trial batch of envelopes will be sent with the first wording (Group A) and an equivalent batch with the second wording (Group B). A form is included that the recipient must fill out to accompany a donation, which includes some indication—such as an A or B code discretely printed at the bottom—to indicate which wording was used. It is then a simple matter to determine which wording results in the best response. However, this is not Big Data in and of itself.

The same process is often used to test changes to Web sites to gauge their effectiveness (Ozolins, 2012). Given the volume of Web site hits that a popular Web site can receive, it can easily represent a significant volume of data being generated. However, this does not necessarily bring it into the Big Data category. Yet, combined with other analyses of user attributes—such as what country the user is coming from, along with demographic data sourced from a user’s account on the given Web site as part of an overall retail analytics strategy—the need to produce results rapidly based on data from a variety of sources including A/B testing may bring it into Big Data territory (Ash, 2008).

Association Rule Learning

Association rule learning is a method for discovering relations among variables in databases, based on identifying rules (Agrawal et al., 1993). For example, a database of supermarket sales may record sales of carrots, beans, beer, coffee, peas, and potatoes. Analysis of the data using association rule learning may reveal a rule that a customer who buys carrots and beans is also likely to buy potatoes, and another that a customer who buys beer and coffee is also likely to buy peas. These rules may be used to govern marketing activity. For example, the supermarket may choose to locate carrots and beans in the same freezer and position a poster advertising a particular deal on bulk potatoes (which delivers a high margin) above it.

When the volume of data of data is high—say, generated by Web site visits on a popular site—it may be classified as a Big Data analysis technique.

Classification

Classification is a general term associated with the identification of categories in a database. It is not a single technique, but a category of techniques serving a variety of purposes in the general area of data mining. It may involve statistical, artificial intelligence, machine learning, or pattern recognition techniques, among others (Fayyad, 1996a). Classification is by no means specific to Big Data, but Big Data analysis frequently employs classification techniques as part of an overall suite of analytical processes.

Crowdsourcing

Crowdsourcing is the use of human effort—typically from an online community—to provide analysis or information rather than using automated processes or traditional employees. This differs from traditional outsourcing in that the humans involved typically do not have a formal employment relationship with the organization that is using them. They may, for example, be recruited online or may simply be casual browsers of a particular Web site. This becomes relevant to Big Data when the diversity of information (i.e., variety) or volume of information to be collected only becomes feasible to collect or process via crowdsourcing (Howe, 2006).

Data Mining

Like classification, data mining is not a single technique but a variety of techniques for extracting information from a database and presenting it in a useful fashion. Originally, the term “data mining” implied discovery of unanticipated findings, such as patterns revealed or relationships uncovered, but over time the term has come to refer to any form of information processing, particularly that involving statistics, artificial intelligence, machine learning, or business intelligence and data analysis (Fayyad, 1996b).

Natural Language Processing and Text Analysis

Large volumes or continuous streams of text, such as Twitter feeds (Twitter, 2013), online forum postings, and blogs, have become a popular focus for analysis. Their unconventional variety, at least from a conventional data processing point of view, and size inevitably associate them with Big Data. In general, natural language processing refers to a variety of techniques that rely on automated interpretation of human languages, which (among a variety of purposes) may be used for machine translation of text, virtual online assistants, natural language database queries, or sentiment analysis (see below) (Jurafsky and Martin, 2009).

Text analysis (also known as text mining) is closely related to natural language processing but refers specifically to studying human text for patterns, categories of text, frequencies of words or phrases, or other analysis of text without the express purpose of acting on its meaning (Feldman and Sanger, 2007) (for more details, see Chapters 11–13).

Sentiment Analysis

Also known as opinion mining, sentiment analysis combines natural language processing, text processing, and statistical techniques to identify the attitude of a speaker or writer toward a topic (Lipika and Haque, 2008). Of particular interest to large corporations seeking to evaluate and/or manage popular commentary on their products or actions, sentiment analysis has grown in significance along with the popularity of social media. The volume, variety, and velocity of postings on social media—particularly in response to hot topics—brings this firmly into the domain of Big Data.

Signal Processing

The volume of data implied by Big Data leads naturally to considering data as a continuous signal rather than discrete units. Processing vast volumes of data as individual transactions may be nearly intractable, but treating the data as a signal permits the use of established data analysis approaches traditionally associated with processing sound, radio, or images. For example, the detection of threats in social networks may be effectively handled by representing the data as graphs and by identifying graph anomalies using signal detection theory (Miller et al., 2011).

Visualization

The velocity of Big Data can present particular challenges in terms of analyzing data; traditional approaches of generating tables of figures and a few graphs are too slow, too user-unfriendly, or too demanding of resources to be appropriate. Visualization is the use of graphical summarization to convert significant aspects of data into easily understood pictorial representations (Intel, 2013).

Computational Tools

This section makes no attempt to be comprehensive, but provides a sample of some notable tools (and, in the case of MapReduce, a category of tools) that are particularly recognized in the context of Big Data. The absence of a tool from this section should not be considered a deprecation of its capabilities, nor should the presence of a tool here be considered an endorsement. The tools listed have been chosen at random from a pool identified solely on the basis of being recognizably connected to the field of Big Data.

Notably and intentionally absent are conventional SQL DBMSs, which are appropriate for managing large, structured collections of data, but are generally regarded as unsuitable for the nearly intractable volume and/or variety of Big Data. Furthermore, they are more than adequately described elsewhere (Date, 2003).

Hadoop

Hadoop is open source software designed to support distributed computing (Hadoop, 2013). In particular, it contains two fundamental components that together address the volume and velocity aspects of Big Data:

1. A distributed file system called HDFS that uses multiple machines to store and retrieve large datasets rapidly.

2. A parallel processing facility called MapReduce (see below) that supports parallel processing of large datasets.

MapReduce

MapReduce is an approach to processing large datasets using a parallel processing, distributing algorithm on a cluster of computers (Shankland, 2008). A MapReduce program consists of two procedures: Map(), which takes the input data and distributes them to nodes in the cluster for processing; and Reduce(), which collects processed data from the nodes and combines them in some way to generate the desired result. A typical production implementation of MapReduce (e.g., Hadoop, which provides MapReduce processing via the identically named MapReduce component) also provides facilities for communications, fault tolerance, data transfer, storage, redundancy, and data management.

Apache Cassandra

Cassandra is an open source, high-performance, distributed Java-based DBMS designed specifically to be fault tolerant on cloud infrastructure (Cassandra, 2013). It is considered an NoSQL database system because it employs standard SQL as a query language but uses an SQL-like query language called CQL. It can integrate with Hadoop. Notably, it does not support joins or sub-queries.

(More details on some of the technical tools and considerations of constructing Big Data system can be found in Chapters 7 and 9.)

Implementation

The development and introduction of new technologies and systems are an uncertain art, as can be seen from the Standish Group Chaos reports and Ciborra (2000). Project success rates are low and levels of cancellation and failure are high. High expectations are endemic and often unrealistic (Gartner, 2012).

Many factors affect the implementation and uptake of new technologies in organizations. Some of the most important ones that will have an impact on Big Data systems are presented and evaluated in this section, which covers critical questions relating to the following major topic areas:

• Implementation issues for new technologies and systems

• Data sources

• Governance/compliance.

The intent of this section is not to provide specific answers or recommendations; rather it is to sensitize all stakeholders involved in Big Data analytics projects to the areas in which risk assessment and mitigation are required to deliver successful, effective systems that deliver benefits.

Implementation Issues

Ultimately, technologies and systems projects need to be implemented so that business can be conducted. Project teams continually encounter many pitfalls on the road to implementation.

New Technology Introduction and Expectations

The IT industry is subject to cycles of hyperbole and great expectations. Big Data is one of the latest technologies, which raises the question of whether it is just a fashion or whether there is a degree of reality. Gartner (2012) provides regular evaluations of the hype cycles of a wide range of technologies, including one for the technologies and tools of Big Data.

The Gartner hype cycle is a plot of the levels of expectation associated with a particular technology or concept or product across time. The first phase is called the technology trigger phase, during which the levels of expectation of feasibility and benefits rapidly rise, reaching a peak of inflated expectation. Typically, products then rapidly descend toward the trough of disillusionment as products fail to deliver the promised benefits. Thereafter, a few products and technologies start to climb the slope of enlightenment as they are seen to deliver some of the hyped benefits and capabilities. Eventually, a very few products reach the final phase of the plateau of productivity because they prove effective. The Gartner hype cycle reports also assess when the technology might reach the final productive stage.

Tables 10.1 and 10.2, sourced from Gartner (2012), show that some aspects of Big Data are already in the plateau of productivity.

Thus, it is clear that the field of Big Data passes the test of operational reality, although there are many aspects in which research is required to ensure the feasibility and effectiveness of the tools and techniques before introduction into high-value and high-impact projects.

Project Initiation and Launch

Information technology projects are particularly prone to lack of success, The Standish Group has been researching and publishing on this problem since the original CHAOS report (Standish Group, 1994), which first identified the scale of the problem.

There seem to be various causes for the lack of success of IT-related projects. A key cause was identified by Daniel Kahneman (2011) as the planning fallacy associated with high degrees of over-optimism and confidence by project planners and sponsoring executives. This is often compounded by the sunk-cost fallacy, described by Arkes and Blumer (1985, cited in Kahneman, 2011, p. 253), in which the size of the investment in a failing project prevents the project team from cutting its losses.

Kahneman describes the planning fallacy as a combination of unreasonable optimism about the likely success of the particular project under consideration and a deliberate act of ignoring external evidence about the likely success of the project based on other, similar projects. He provides strong evidence that this is a universal human behavior pattern in both corporate and personal environments, leading to project cost overruns of 40% to 1000% and time scale overruns of similar size factors, often compounded by the sunk-cost fallacy. To mitigate the effects of the planning fallacy, Kahneman provides evidence from both the transport (Flyvberg, 2006) and construction industries (Remodeling 2002, cited Kahneman, 2011, p. 250), and from his own teaching and research, that it is critical to obtain an outside view for actual achievements from a wide range of similar projects. This represents the development of reference class forecasting by Bent Flyvberg from a comprehensive analysis of transport projects (Flyvberg, 2006).

Table 10.1

Rising to the Peak of Inflated Expectations

Technologies on the Rise	Time to Deliver, in Years	At the Peak of Inflated Expectations	Time to Deliver, in Years
Information valuation	>10	Dynamic data masking	5–10
High-performance message infrastructure	5–10	Social content	2–5
Predictive modeling solutions	2–5	Claims analytics	2–5
Internet of things	>10	Content analytics	5–10
Search-based data discovery tools	5–10	Context-enriched services	5–10
Video search	5–10	Logical data warehouse	5–10
		NoSQL database management systems	2–5
		Social network analysis	5–10
		Advanced fraud detection and analysis technologies	2–5
		Open SCADA (Supervisory Control And Data Acquisition)	5–10
		Complex-event processing	5–10
		Social analytics	2–5
		Semantic web	>10
		Cloud-based grid computing	2–5
		Cloud collaboration services	5–10
		Cloud parallel processing	5–10
		Geographic information systems for mapping, visualization, and analytics	5–10
		Database platform as a service	2–5
		In-memory database management systems	2–5
		Activity streams	2–5
		IT service root cause analysis tools	5–10
		Open government data	2–5

Sourced from Gartner (2012).

A reference class view should encompass a range of comparative metrics of achievements compared with the original plans that were approved at the time of the project launch for as large a collection of projects as possible. This will provide a means of baselining the specific project, which can then be modified in the light of any specific project factors, and which may justify a more optimistic or pessimistic plan.

Table 10.2

Falling from the Peak to Reality

Sliding into the Trough of Disillusionment	Climbing the Slope of Enlightenment	Entering the Plateau of Productivity
Typically 2–5 years delivery		Typically >2 years delivery
Telematics	Intelligent electronic devices (2–5)	Web analytics
In-memory data grids	Supply chain analytics (obsolete)	Column-store DBMS
Web experience analytics	Social media monitors (<2)	Predictive analytics
Cloud computing	Speech recognition (2–5)
Sales analytics (5–10)
MapReduce and alternatives
Database software as a service
In-memory analytics
Text analytics

Sourced from Gartner (2012).

Information Technology Project Reference Class

The Standish Group research identified that in the 19 years since its first report in 1994, the level of successful IT-related projects has averaged approximately 30% of all surveyed projects (on time, to budget, and delivering all specified functionality in relation to the launch contract). This is approximately half the rate of successful projects (defined as meeting their original business goals) in an international survey by the Project Management Institute across all business sectors and types of projects (ESI Int, 2013). Current research has not clearly identified why IT-related projects are so much less successful as a class than other projects.

Whereas the Standish Group reports provided some approaches that are claimed to improve the probabilities of success and reduce the likelihood of outright failure, the overall rates of project success (narrowly defined as on time, to budget, and delivering the full agreed functionality) have not improved significantly since 2002. It is of significant concern that the proportion of failed projects (projects that never achieve implementation) has steadily increased since 2002. Challenged projects are defined as those that are implemented in part, overrun time and cost budgets, and fail to deliver the agreed functionality (i.e., fail to meet the quality target). The definitions are provided in the original 1994 Standish Group report.

A key finding was that projects with a value of over $10 million have a 0% probability of being delivered on time, to budget, and with all contracted functionality, whereas small projects with a total budget of less than $750,000 have a relatively high probability of success, at 55% (Standish Group, 1999). This set of data provides salutary evidence of the continued existence of the planning fallacy among planners and sponsors of IT-related projects.

Those involved in the development and implementation of Big Data analytics projects are therefore strongly encouraged to use these data as a base-level reference class from which to develop their project planning estimates. Continued research is also required to collect the reference class data for Big Data analytics projects to provide a more refined base-level forecast for the future.

Mitigating Factors

The Standish Group also collected data on the top factors reported by chief information officers (CIOs) who completed the Chaos surveys as their assessment of the contributory causes for project success (see Table 10.3).

User involvement has remained a critical factor in the success of projects through the years, as has the importance of the active support of company executives. In addition, it is becoming clear that the agile approach is a factor in successful projects. Owing to the novelty of the field of Big Data analytics, user involvement and agile, exploratory approaches will be necessary to be able to deliver value to the business.

User Factors and Change Management

It is clear from Table 10.3 that users have always been highly important in contributing to the success of a project. This is reinforced by almost all textbooks and commentaries on the subject of the introduction of changes to the working environment and new business practices. Burnes (2009) provides a particularly good insight into the critical factors for the involvement of users successfully introducing new systems and technology. Because the field of Big Data analytics will involve significant developments of new ways of gathering data and then analyzing and presenting them in ways that provide valuable business insights, it will be especially important for the development team to work closely with users and management to facilitate this process. It will be vital to identify the change champions (Burnes, 2009) from among the users, who have both a deep understanding of the business domain and the corporate strategy, working practices, and internal culture together with a good understanding of the capabilities and limitations of information technology. Change champions may not always be senior, but they will have highly developed internal networks and will be highly trusted by their colleagues.

Table 10.3

Standish Group Success Factors

1994	1999	2001	2004	2010, 2012
1. User involvement	1. User involvement	1. Executive management support	1. User involvement	1. Executive support
2. Executive management support	2. Executive management support	2. User involvement	2. Executive management support	2. User involvement
3. Clear statement of requirements	3. Smaller project milestones	3. Competent staff	3. Smaller project milestones	3. Clear business objectives
4. Proper planning	4. Competent staff	4. Smaller project milestones	4. Hard-working, focused staff	4. Emotional maturity
5. Realistic expectations	5. Ownership	5. Clear vision and objectives	5. Clear vision and objectives	5. Optimizing scope
6. Smaller project milestones				6. Agile process
7. Competent staff				7. Project management expertise
8. Ownership				8. Skilled resources
9. Clear vision and objectives				9. Execution
10. Hard-working, focused staff				10. Tools and infrastructure

Change champions will need to be introduced to the concepts and objectives of Big Data analytics at an early stage in the project so that they can appreciate the potential of Big Data analytics for the business and can identify critical business insights that can be developed from the data. Their active involvement will help to sell the project to all involved and, as important, to identify areas of change that will need particularly sensitive treatment in the overall change management program.

Data Sources and Analytics

Big Data analytics can be sourced from a wide variety of environments; however, they can generally be categorized into public sourcing (cloud/crowd) and internal corporate data collection. Different consequences relate to the main sources of Big Data.

Cloud/Crowd sourcing

Much of the data involved in Big Data analytics are sourced from social media such as Twitter, Facebook, and YouTube and similar types of systems including text and multimedia (Minelli et al., 2013). These types of sources satisfy all of the V’s of the definition of Big Data. The purpose of analytics in this area is to attempt to obtain an understanding of the public’s thinking and the impact of its perceptions on business strategy, operations, and tactics using all possible sources including both structured and unstructured data. Corporations will often use these sources to analyze customer perceptions of competitor products and services as well as of their own products and services.

One critical issue of using this type of data is how much trust can be placed on the accuracy of the data and the intended insights to be derived through the analytics process (this relates to the fifth V of Big Data, i.e., veracity). Strong anecdotal evidence suggests that much of the data provided by the public on Web pages are deliberately falsified, particularly elements relating to identity and demographic data. Often, these data items are particularly important to the intended analytics exercise.

Corporate Systems

Executive systems within large organizations are another source of Big Data, particularly in the financial services and health care fields (Minelli et al., 2013), where the IT systems collect very large volumes of customer data. Aerospace and car manufacturer companies and their systems suppliers collect large volumes of data relating to the operation and use of their products, such as aircraft and engine performance data. These data are collected by the equipment condition monitoring systems to gain insights at fleet level into how the products are used, with the aim of assisting in the development of revised and new products and diagnosing the causes of specific failures. Equipment operators also use analytics to manage the maintenance and overhaul of their fleet, often using predictive analytics in the process. In the financial services arena, Big Data analytics are often involved in risk analysis and management and in fraud detection and prevention.

Even though the data are derived from the operational systems and processes of the relevant organization, it is not possible to fully trust the accuracy of the data provided to the analytics processes. Corporate data are rarely completely accurate or up-to-date.

An endemic problem in much of the financial services world is the failure of organizations to adequately manage the use of end-user type tools (such as spreadsheets and macros), which are the source of key regulatory reporting data streams (at very high velocity and volume) and management data (such as real-time capital risk analysis and exposure) in the investment banking field.

Persistent research and many publications have demonstrated that approximately 95% of all complex spreadsheets contain errors (Deloitte, 2010; Panko, 2008; Panko and Port, 2012). In addition, many organizations then link spreadsheets together in large networks of interdependency (Cluster Seven, 2011), which results in the propagation of erroneous information through the network until it is used both executively and in decision making.

Most organizations implementing enterprise resource planning systems have found it vital to carry out extensive data cleansing exercises before transferring the data from the legacy systems into the new systems. Often, 50–70% of all data in the databases are removed. This should be another warning flag to the creators and users of Big Data analytics projects: The critical question is, “Are my data clean enough to give valid insights?” This clearly presents a challenge in relation to the fourth and fifth V’s of Big Data (veracity and value) (see Chapter 1).

Analytics Philosophy: Analysis or Synthesis

The core science that supports analytics is statistics. Fundamentally this identifies correlations among factors and data items to identify patterns and trends from historical data. However, correlation and causation are not equivalent. It is also vital to recognize that there is no guarantee that the future will be the same as the past. The use of correlation-based analytics of data (from whatever source) is an exercise in pattern identification from the past. It is also simplistic in its statement that “these two things happen together.” The more insightful approach is to develop a clearer understanding of what factor causes the response (causal relationships), because this will enable management to better assess the full business model and business case for actions developed as a consequence of the analysis (see also discussions in Chapter 17 on cultural influences on data patterns).

An example of the problem of understanding the past from analysis of the data is seen in the aftermath of the Credit Crunch in 2008–2009, in the extreme difficulties that the economics profession had in understanding what the levers of control had become for the post–credit crunch economic environment. They found that few if any of the economic correlations and models from the past seemed to apply in understanding the influences and interactions of interest rates, inflation, and employment. The world had changed and a new set of relationships needed to be developed, modeled, and understood; the past was not the same as the future. Part of the problem in this example is that many of the different economic models are based on correlation rather than a fundamental understanding of causal relationships (often because of the extreme complexity of the models and the real world).

The attraction of the statistical analytics approach is that it has the feel and appearance of the scientific process; gather data, analyze according to algorithms, and identify statistically based results that identify the contributing components of the relationships. It is, however, an exercise of thinking within the box, within a single domain of knowledge.

To obtain truly valuable insights, a different approach is often effective. This is the approach of synthesis: thinking outside the box and connecting two or more domains of knowledge (Buytendijk, 2010). Incorporating the concept of synthesis into the application of Big Data analytics will be a critical challenge for many organizations.

Success in developing the synthesis of concepts and domains to identify the most important insights will be a critical differentiator between organizations that use analytics as a commodity tool and those that are able to generate sustainable strategic advantage (Carr, 2004).

Governance and Compliance

Information governance and compliance is a well-recognized topic for CIOs (Soares, 2012). Big Data is information technology and is therefore in one sense just business as usual for CIOs and their teams from the perspective of information governance, using frameworks such as International Organization for Standardization 27001/27002 or Control Objectives for Information and Related Technology to define their information governance strategy. However, specific issues may be significant challenges in meeting data protection requirements in many legal jurisdictions (see Chapters 15 and 16).

Data Protection Requirements and Privacy

Within the European Union (EU), the data protection regime is strict about the collection, storage, protection, and use of data that can be linked to identifiable citizens (Data Protection Directive 95/46/EC), which is currently implemented via national legislation in each EU country. The EU is in the process of revising this directive to recognize the impact of new factors such as globalization, the cloud, and social networks to create a general data protection regulation to provide a unifying framework for the whole EU. The planned timescale is to introduce it in 2014 and for it take effect in 2016 (EU, 2012).

The impact of the restrictions in the current regime on the transfer of personal data to third countries is particularly significant in its effect on cloud-based processing, where it may not be easy or even feasible to determine the location of the storage and processing and whether it meets EU requirements. In theory, the United States (US) Safe Harbor framework (US Dept of Commerce, 2013) meets many of the requirements of the EU framework. However, there are still significant differences between the two regimes, especially in the level and type of access to personal data by government agencies.

In principle, the EU position is that general access by government agencies is highly regulated and permissible only in the course of investigating a specific crime, whereas the US principle is far more relaxed and operates with relatively few constraints. The decision of the Society for Worldwide Interbank Financial Telecommunication network to re-engineer its systems in 2009 to separate EU and non-EU data processing, to ensure that financial data relating to EU organizations were not accessible to the US agencies (SWIFT, 2008) illustrates the difficulties for Big Data users in a highly interconnected global world.

The disclosures in the Guardian (Greenwald et al., 2013) provided by Edward Snowden identified the levels and types of access to personally identifiable data in many jurisdictions. Whereas many of his disclosures relate to the US, some also relate to apparent issues in the EU. The impact of this, now public domain knowledge, raises significant questions related to an organization’s ability to comply with the European Data Protection Act (DPA) regime.

One of the criteria of the EU regime is the agreement by the data subject to the collection and use of the data (Article 7). This has consequences that are not currently clear for the collection and processing (analytics) of social media sources such as Twitter or the use of personal and location data by Google in the direction of advertisements toward the users of its services (Davis, 2012). The fundamental questions are who actually owns the personal data, who has the right to use it, and to what purpose. The current EU DPA regime is clear that it lies with the data subject; however, in practice many organizations take the ethically suspect position that they can use it to provided added value to their customers and users.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 10. Tools and Technologies for the Implementation of Big Data

Create new playlist

Sign In

Sign Up