Chapter 4: Big Data Solutions

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

CHAPTER 4

Big Data Solutions

Data is not information, information is not knowledge, knowledge is not understanding, understanding is not wisdom.

—Clifford Stoll

From the last chapter, we know that Big Data is a bit of a catchall term. So where do we go from here? This chapter provides an overview of the most prevalent Big Data tools and resources at reasonably high levels.56 This chapter will not delve into highly technical details. Don’t expect complicated schematics. This is a book about the business case for Big Data, not an implementation guide for any one application. As discussed in Chapter 1, old tools like relational database management system (RDBMSs) just can’t efficiently handle Big Data. Different times call for different solutions, and it’s time to get familiar with Hadoop, NoSQL, columnar databases, and other emerging Big Data tools.

Note that this is the closest thing to a technical chapter in the entire book. Here I endeavor to keep things at a relatively high level, not to inundate the reader with needless complexity. Yes, database schemas, nodes, clusters, in-memory databases, data compression, parallel processing, and other technical concepts are essential concepts that underscore Big Data. However, here they are intentionally kept to a bare minimum. This isn’t that type of book. The main point of this chapter is that Big Data encompasses a variety of new data sources and types, as well as increased data volumes and velocity. As such, to effectively utilize Big Data, organizations need to deploy new tools, such as the ones found in the following pages. For the most part, traditional row-based databases just can’t handle Big Data very well, a point shared by many industry experts like Rich Murnane, Director of Enterprise Data Operations at iJET International. “Let’s say that an organization hires me to build a data system that needs to receive and store hundreds of millions of sensor data records per day,” Murnane told me. “If I advise the organization to try to use a relational database such as I have in the past, that client should probably run out and get a new consultant.”1

BIG DISCLAIMERS

I’ll try here to preempt those who are probably already frothing at the mouth right about now. I am convinced that some folks have read the first two paragraphs of this chapter and are getting ready to slam me on Twitter, write me a nasty e-mail, or comment on my blog. The gripe: Big Data does not necessarily require new databases and tools because traditional data warehouses, data marts, and databases are evolving.

Without question, most large software vendors are not standing still. They are augmenting their current products and launching new ones. As far as I can tell, their general goal is to allow their clients to better handle the massive amounts of semi-structured and unstructured data available today. For instance, consider Sybase, a company founded in 1984. The company sells products in data management, analytics, and mobility, among others. In May 2010, SAP acquired it, keeping the Sybase brand. (Sybase now describes itself on its site as “an SAP company.”) Despite the acquisition, the innovation at Sybase has continued. It is not simply there to support its legacy products. The company now sells Big Data solutions.2 Also, as we’ll see later in this chapter, the latest version of Microsoft’s SQL Server 2012, its relational database, integrates with and simplifies Hadoop—or at least Microsoft claims as much.3

My point in this chapter is not that the more established tools are objectively inferior to the newer ones across the board. Nor do I mean to imply that the traditional software vendors will never alter or improve their current offerings. As the following pages will show, that’s clearly not the case. Rather, those looking to take advantage of Big Data should understand that mature data warehouses, relational databases like dB2, SQL Server, Oracle, and other solutions are often not, by themselves and as currently constituted, ideal for these purposes. New technologies and solutions meet legitimately new business needs—and deal with data the scale of which we have heretofore never seen. Yes, organizations have available to them more data than ever (read: increased velocity, variety, and volume). To combat these Big Data “problems,” there’s an increasing array of powerful solutions. Today, there’s never been more choice and, without any particular bias, this chapter explores these options.

Now, I am not claiming that the traditional data warehouse no longer serves a valuable purpose and that it will never evolve. I am not that presumptuous, and I am certainly not omniscient. I only wish to alert the reader to new technologies that are arguably much better suited for Big Data than many well-trodden solutions developed fifteen years ago. As a general rule, we just can’t handle new data sources and types with the same legacy tools.

Next, organizations should not throw the baby out with the bath water. I don’t foresee a day in which Big Data solutions can do what customer relationship management (CRM), enterprise resource planning (ERP), and other essential enterprise applications can do. More likely, applications like Hadoop will replace existing data warehouses, datamarts, and ad hoc querying and reporting tools.

Next, these tools are constantly changing—and new ones are being developed as we speak. For instance, it’s entirely possible that current limitations of Hadoop (like the lack of real-time reporting) as of this writing cease to be limitations by the time you read this book. Yes, Big Data solutions are evolving that quickly.

Finally, please note that, by design, this chapter covers topics in a manner best described as wide, not deep. Many big books have been written about most of the solutions covered here. As such, this chapter should serve as a primer—a launching pad for more detailed discussions about an organization’s specific Big Data needs. It is not intended to be remotely comprehensive.

With these disclaimers out of the way, let’s move on to the specific Big Data solutions that many organizations are currently using.

PROJECTS, APPLICATIONS, AND PLATFORMS

It’s hard to think of Big Data solutions as applications in the traditional sense. For instance, Microsoft Excel and Outlook seem to better fit the definition of an application. Yes, each can do some pretty amazing things, but to compare them to Big Data software is analogous to saying that the Eiffel Tower is a just another building. It just doesn’t seem right.

Irrespective of moniker, though, Big Data doesn’t just happen by itself. Even an individual Big Data technique like A/B testing or sentiment analysis still necessitates some type of service, project, software program, or platform. This section examines some of the more mainstream ones.

Hadoop

Any conversation today about Big Data tools has to start with Apache Hadoop, the large collection of open-source projects that distributes and processes data. Collectively, the Hadoop stack and its different components allow organizations to store and make sense of vast amounts of semi-structured and unstructured data. GigaOM calls Hadoop “the world’s de facto Big Data platform.”4 Today, Yahoo!, Facebook57, LinkedIn, American Airlines, IBM, Twitter, and scores of other companies use Hadoop. Its popularity can be attributed to a number of factors, including these:

It can handle many different types and source of data, including structured, unstructured, log files, pictures, audio files, communications records, and e-mail.
It scales easily and across multiple servers (i.e., it is schema-less).
It has high fault tolerance.
It’s extremely flexible.
It’s an open-source project that has spawned its own ecosystem, a community that seeks to improve the product.

At present, there is no one “official” Hadoop stack or standard configuration. As of this writing, Hadoop includes more than a dozen dynamic components or subprojects, many of which are complex to deploy and manage. Installation, configuration, and production deployment at scale is often challenging.5 For a much more comprehensive technical look at Hadoop and some of its components, check out a book like Hadoop: The Definitive Guide by Cloudera engineer Tom White.

Importantly, Hadoop’s origins stem from Google’s MapReduce and the Google file system,58 topics certainly worth exploring here. MapReduce takes a unique approach to processing vast amounts of relatively new data types. (In its first incarnations, Hadoop only performed MapReduce jobs, but those days are officially over.) Without getting too technical here, MapReduce works as follows:

It breaks Big Data problems into much more manageable subproblems.
It distributes those subproblems to myriad “processing nodes.”
It reaggregates them into more digestible datasets.

In this sense, Hadoop is not a database per se, at least in the traditional sense of the term.6 Rather, it is a file system. More formally, the Hadoop Distributed File System (HDFS) stores vast amounts of data used by other parts of the Hadoop stack. HDFS works closely with another component: MapReduce, the distributed programming framework designed to run on commodity hardware. “MapReduce doesn’t provide access to real-time data, but that’s changing thanks to newer Hadoop components like HBase,” Scott Kahler tells me. Kahler is the Big Data Architect of Adknowledge, the fourth largest advertiser marketplace.7

HBase, the open-source implementation of Google’s NoSQL architecture, is becoming an increasingly key part of the Hadoop stack. Think of it as a distributed and scalable Big Data store. HBase and Impala allow for traditional data indexing and in-memory storage, allowing users more instantaneous access to data. (The scale of HBase is enormous: it can support billions of rows and millions of columns, while concurrently ensuring that both write and read performance remain constant.) Many IT executives are asking pointed questions about HBase and its offshoots. For instance, the HBase NoSQL database is built on top of HDFS. It shows what’s possible when Hadoop is freed from the constraints of MapReduce.

While not entirely mature, Hadoop has certainly evolved from its early days. “It’s an exciting time today for Hadoop and its users. The ability to do near-real-time queries on a massive data storage and processing framework is finally becoming a reality,” continues Kahler. “It has been a godsend and represents a major shift in the Big Data world. We now can worry less about how we handle the data and more about the actual insights that we can drive from it.”

The Hadoop Ecosystem

Near the end of my last book, The Age of the Platform, I examined smaller companies that are embracing platform thinking. Amazon, Apple, Facebook, and Google might run the world’s largest platforms, but by no means are they the only ones. Today, WordPress, Salesforce.com, HubSpot, and scores of other companies are allowing—and even encouraging—others to take their products and services in new and exciting directions. Not surprisingly, the same thing is happening with Hadoop.

Because of its open-source origins and rising popularity, Hadoop continues to evolve, as does its ecosystem. Since its inception, complementary projects have leveraged Hadoop’s core functionality—and extended it in different directions. Hive and Pig are two of the most prominent Hadoop extensions, supported by companies like Cloudera and Talend.8 While Hadoop is enormously useful as presently constituted, its popularity also stems from the fact that many people believe that it will continue to evolve and improve over time. The history of technology teaches us that software programs have limited shelf lives, but Hadoop seems like a solid bet, at least for the foreseeable future.

Of course, true open-source projects do not need the permission of senior management at a private company or the imprimatur of a government agency. Anyone with the technical chops and desire can start or contribute to an open-source application and see where it goes. Projects like WordPress, Linux, and Firefox all benefit from active and knowledgeable communities of users and experts who often volunteer their skills after they come home from their day jobs.

The Hadoop community is a vibrant one full of smart cookies. Many communicate online or see each other at events like Hadoop World,9 a conference with attendance that has quintupled in size over the past three years. Regardless of where and how they meet, members of the Hadoop community are taking the product in new and exciting directions. Entire companies are building products on top of Hadoop (extending the product’s core or native functionality) and supporting it in the enterprise.

Cloudera

We’ll see later in this chapter how Google makes some of its powerful Big Data tools available for public usage. Still, most of Google’s Big Data software is proprietary and lies behind closed doors. Five years ago, that reality started to irk Google employee Marcel Kornacker, so he decided to do something about it. As Kornacker told Wired magazine, “I wanted to work on something similar to what I had been doing but in a more publicly accessible context.”10

Kornacker left Google and baked bread for two weeks before joining a little company named Cloudera in 2008. Based out of Palo Alto, California, the enterprise software company provides Hadoop-based software, support, services, and training. In short, it helps enterprises become more data driven. Thanks to products like Impala, as of this writing, Cloudera has become the biggest vendor of commercial Hadoop technology.

Kornacker believes that Google “sees its custom data center creations as a competitive advantage that should be guarded from rivals. It builds this software only for itself—although plenty of others use it. By contrast, Cloudera builds software for everyone.”11 Kornacker’s smart, but I don’t fully share his viewpoint. As this chapter will show, Google makes many of its products and services available for others to use, including some of its Big Data solutions.

Hortonworks, MapR, and Splunk

Cloudera may be one of the elephants in the Hadoop room, but it’s by no means the only provider or partner. Other companies recognize the tremendous business opportunity available to them and have moved quickly to position themselves as major Hadoop players. These include Hortonworks, a company “focused on accelerating the development and adoption of Apache Hadoop software.”12 For its part, MapR makes “managing and analyzing Big Data a reality for more business users.”13 It has created a commercial distribution of Hadoop and implements its own proprietary file system. (We’ll see in Chapter 5 that Quantcast did something very similar.)

You may be wondering how any organization can create its own file system based upon Hadoop. After all, no one can legally fork a version of Oracle’s enterprise applications or Microsoft SQL Server. Because Hadoop is open source, this behavior isn’t only legal: it’s encouraged. Plus, unlike traditional ERP and CRM vendors, there is no official sanctioning body for Hadoop certification. Anyone can start a Hadoop service firm or development shop without a type of imprimatur. Such is life in the software world. (Note, however, that Hadoop has a governing body of sorts. Its committers review product patches, new code, and enhancements. Many large-scale open-source projects work like this. Committers are usually small and closely coordinated communities of senior contributors to a project.)

Increasingly, machines are generating more and more data, something that will only intensify as the Internet of Things accelerates. (This is discussed in more detail in Chapter 8.) Founded in 2003 (five years before Hadoop existed), Splunk has carved out an interesting niche for itself. According to its website, the company “indexes and makes searchable data from any app, server, or network device in real time including logs, config files, messages, alerts, scripts, and metrics.”14 These files can often grow to sizes simply unmanageable by many mainframes, often forcing organizations to store a limited amount of data and archive the rest.

Splunk’s clients include Groupon, Zynga, Bank of America, Akamai, and Salesforce. As of this writing, it employs nearly 500 people and has even been issued a U.S patent. To its credit, Splunk management quickly realized the power of Hadoop and soon pivoted, offering a number of powerful Hadoop-related offerings. Perhaps Splunk’s most interesting offering is Hadoop Connect, a user-friendly product that “helps integrate and move data easily between Splunk Enterprise and Hadoop. Conversely, data already in Hadoop can be sent to Splunk for analysis without users having to write code.”15 In other words, Splunk recognizes the power of getting data into and out of Hadoop. Given the size and variety of potential uses of Big Data, flexibility is king.

Emerging Hadoop-Based Start-Ups

Later in this chapter, we’ll explore the differences between traditional RDBMSs and columnar alternatives. For now, suffice it to say that different databases are probably best at handling very different types of data. But what if you could handle all types of data in a single, hybrid “system” or database? That’s the thinking behind Hadapt, another promising Hadoop offshoot. And the idea has legs, as evinced by the fact that it has already received nearly $10 million in venture capitalist (VC) funding.16 From the company’s website, its patent-pending technology features a hybrid architecture that brings

the latest advances in relational database research to the Apache Hadoop platform. RDBMS technology has advanced significantly over the past several decades, but current analytic solutions were designed prior to the advent of Hadoop and the paradigm shift from appliance-based computing to distributed computing on clusters of inexpensive commodity hardware.17

Hadoop serves as Hadapt’s foundation. It offers one-stop shopping—an all-in-one system for structured, unstructured, and multistructured data.

But don’t think for a minute that Hadapt is the only Hadoop-based start-up or project. Far from it. Start-ups as we speak are working on hybrid tools that handle structured and unstructured data in a single system. RainStor seeks to turn an organization’s historical (and “frozen”) Small Data into Big Data. Its product “uses sophisticated data compression and de-duplication techniques to reduce the storage footprint by 95%+ less. Data retained in RainStor can be queried and analyzed directly using SQL, your favorite BI tool, or MapReduce on Hadoop without restoring or re-inflating the data.”18 RainStor stores data in partitions (i.e., large blocks that organizations can easily manage using standard file systems, HDFS, and low-cost storage platforms). The result is a low overall total cost of ownership.

Like Hadapt, VCs believe that Rainstor is on to something. The company has raised $12 million.19 And it continues. Backed with $20 million from Battery Ventures, Andreessen Horowitz, and Sutter Hill Ventures, Platfora “aims to do that with an intuitive user interface that has advanced data science functions built in, rather than making users perform queries.”20 Many folks as we speak are building analytic apps that sit on top of Hadoop.

Existing Enterprise Vendors

No one would ever call Oracle Corporation a new company or a start-up with its storied history and a market capitalization around $150 billion.21 Led by bombastic CEO Larry Ellison, Oracle is famous for acquiring companies at a frenetic pace. Given the popularity of Big Data, it should be no surprise that Oracle sells a number of proprietary, closed-source products that help organizations store and interpret vast amounts of unstructured data.22 More shocking to some, though, is the fact that Oracle has developed products that work with Hadoop.23 And Oracle isn’t the only large enterprise software vendor to integrate open-source solutions into its product lines. IBM years ago recognized the power of Linux and bet big on it.24 After years of pooh-poohing open source solutions, Microsoft got a little bit pregnant with shared source.25 Many traditional software vendors are recognizing the power of open-source Big Data tools—and Hadoop in particular.

And Oracle is not alone here in jumping on the Hadoop train. IBM sells Infosphere BigInsights, an analytics platform that lives on top of Hadoop. SAP launched its HANA platform that tightly integrates with Hadoop.26 In late-October 2012, as expected, Microsoft announced that it had launched a fully Windows-compatible Apache Hadoop distribution.27 The HDInsight Server is designed to work with (but does not include) Windows Server and Microsoft SQL Server. For its part, EMC claims offering a Hadoop-friendly, “unified, and high-profit-margin, Big Data system.”28

Bottom line: Large software vendors aren’t standing still; they are reacting to the Big Data trend. Many of their clients are expressing strong interest in Big Data. Making Big Data happen requires the ability to store, retrieve, and analyze vast amounts of information. To the extent that organizations prefer integration over data silos, more familiar business intelligence (BI) and data warehousing tools have to play nice with new Big Data solutions like Hadoop.

Limitations of Hadoop

While enormously powerful, Hadoop is anything but perfect. First, as of this writing, it does not provide real-time information—although it can get pretty close thanks to supplemental components HBase and Impala. Second, programming in Hadoop is not for the faint of heart. What’s more, data consolidation may pose its own set of problems. “Aggregating data into one environment . . . increases the risk of data theft and accidental disclosure,” says Richard Clayton, a software engineer with Berico Technologies, an IT services contractor for federal agencies.29 In other words, data silos may in fact be more secure than data marts, although I’d vehemently argue that the cons of these data islands far exceed their pros. Why “fix” one problem by failing to address another more serious one? Finally, as of this writing, as mentioned at the beginning of this section, Hadoop lacks formal industry standards. James Kobielus of IBM writes, “The Hadoop market won’t fully mature and may face increasing obstacles to growth and adoption if the industry does not begin soon to converge on a truly standardized core stack.”30

Because of this and some technical considerations, plenty of people believe that Hadoop’s days are numbered.31 Based upon my research for this book, however, I strongly disagree. More likely, Hadoop will evolve and improve over time—and process an increasing share of the world’s data. Many skilled people and organizations are working on reducing the technical and knowledge gaps that exist today.

OTHER DATA STORAGE SOLUTIONS

Chapters 1 and 2 showed that, as a general rule, Big Data just doesn’t play nice with native relational databases and SQL. All of this data has to be stored somewhere and, if stalwarts like Oracle and SQL Server aren’t suitable, what’s an enterprise curious about Big Data to do?

NoSQL Databases

The past few years have seen the rise of the NoSQL “database.”59 Before continuing, it’s essential to make three points. First, I put the word “database” in quotes for a specific reason. A NoSQL database is only a database in a very general sense. When most people think of proper databases, they conjure up images of the relational kinds rife with tables described in Chapter 1. (See Tables 1.1 and 1.2 and Figure 1.2.) Perhaps more accurately, one should think of these NoSQL databases as “data stores.” Second, the term NoSQL connotes a binary, as in SQL databases rely upon SQL, while NoSQL databases do not. In reality, though, NoSQL databases use “not only SQL.” That is, they don’t rely exclusively upon SQL. Finally, many organizations concurrently use both SQL and NoSQL databases for different purposes. Using one in no way obviates or precludes using the other.

The NoSQL movement began in 2009 and took off quickly. Reasons include its general utility, the limitations of RDBMSs, and the price of many NoSQL solutions (read: free). As of this writing, there are already more than 120 projects listed on the site nosql-database.org. At a high level, NoSQL databases generally break down into four main types, as presented in Table 4.1.

Table 4.1 The Four General Types of NoSQL Databases

Type	Description	Examples
Key-Values Stores	The main idea here is using a hash table where there is a unique key and a pointer to a particular item of data. The key/value model is the simplest and easiest to implement. But it is inefficient when you are only interested in querying or updating part of a value, among other disadvantages.	Tokyo Cabinet/Tyrant, Redis, Voldemort, Oracle BDB
Column Family Stores	These were created to store and process large amounts of data distributed over many machines. There are still keys, but they point to multiple columns. The columns are arranged by column family.	Cassandra, HBase, Riak
Document Databases	These were inspired by Lotus Notes and are similar to key-value stores. The model is basically versioned documents that are collections of other key-value collections. The semi-structured documents are stored in formats like JSON. Document databases are essentially the next level of Key/Value, allowing nested values associated with each key. Document databases support querying more efficiently.	CouchDB, MongoDb
Graph Databases	Instead of tables of rows and columns and the rigid structure of SQL, a flexible graph model is used which, again, can scale across multiple machines. NoSQL databases do not provide a high-level declarative query language like SQL to avoid overtime in processing. Rather, querying these databases is data-model specific. Many of the NoSQL platforms allow for RESTful interfaces to the data, while others offer query application programming interfaces (APIs).	Neo4J, InfoGrid, Infinite Graph
Adapted from “Picking the Right NoSQL Database Tool” by Mikayel Vardanyan32

As the examples in Table 4.1 illustrate, there are many types of NoSQL databases. Rather than just thinking about a single NoSQL database, it’s better to think of NoSQL as an entirely new category of databases. In point of fact, this isn’t uncommon. In the open-source world, there may be several current or sanctioned or mainstream versions of a particular application, but typically different alternatives abound. NoSQL is no exception to this rule.

Like Hadoop, NoSQL databases suffer from a fair number of drawbacks. For one, by themselves, they offer few facilities for ad hoc queries and analysis. Even a simple query requires significant programming expertise, and commonly used BI tools do not provide connectivity to NoSQL.

Some relief is provided by the emergence of new Hadoop extensions, such as Hive and Pig. These projects can provide easier access to data held in Hadoop clusters and, perhaps eventually, other NoSQL databases. Quest Software has developed Toad for Cloud Databases, a product that provides ad hoc query capabilities to a variety of NoSQL databases.33

Table 4.1 lists the different types of NoSQL databases and, to be sure, some are more popular than others. However, one warrants a longer mention here. Launched in 2008, Apache Cassandra34 is a relatively mature, fault-tolerant, high-performance, and extremely scalable database used by companies such as Netflix, Twitter, Constant Contact, Reddit, Cisco, and scores of others. These companies share one common characteristic: they rely upon enormous amounts of data to power their business. Cassandra integrates with Hadoop and supports MapReduce.

I have not used Cassandra and can’t say with certainty that it’s fundamentally “better” on some term than other NoSQL databases—not to mention those not included on Table 4.1. When looking at NoSQL solutions, cost and quality are but two considerations. For any open- or closed-source solution, it’s important for organizations to ensure that an adequate support network exists, even if external hires are planned. Remember that, like all Big Data tools, NoSQL is relatively new. Existing expertise is lacking. In the case of Cassandra, because of the project’s high level of demand, a vibrant third-party network has sprouted.35 Make sure that you’re not going down a largely unsupported road.

NewSQL

Think of RDBMSs and SQL as old and reliable cars. Untold numbers of people and organizations have used both extensively for decades, including yours truly. (Five years ago, I made nearly all my income off consulting gigs involving RDBMSs, SQL, and general reporting.) Those automobiles worked fine then, and they still run today. However, they were manufactured in an entirely different era. Their vintage feels aside, those cars cannot possibly take advantage of huge improvements in fuel efficiency, advances in engineering and manufacturing, and so on. And let’s not forget to mention bells and whistles like iPod connectivity (a must-have for me), GPS, Bluetooth, OnStar, and many others. It should be no surprise, then, that cars today can do things not possible three decades ago. Today, from a technology perspective, even $25,000 gets you quite a bit of car. Should it be any surprise that RDBMSs and SQL are evolving as well?

Companies like VoltDB are working on an entirely new generation of RDBMSs based upon a totally different architecture—SQL 2.0, or NewSQL, if you like.60 From the VoltDB company website, “Traditional RDBMSs . . . are based on a one-size-fits-all model we refer to as OldSQL. This model is challenged by the exponential transaction growth that led to the evolution of non-relational data stores,61 collectively referred to as NoSQL. A new generation of RDBMSs, known as NewSQL, take[s] a radically different approach that combines the speed and scale of NoSQL with the proven capabilities of OldSQL.”36

What if a traditional RDBMS could combine all that’s good about OldSQL, but without the baggage? At least that’s the promise of NewSQL. Before jumping in with both feet, though, understand that NewSQL is still playing out, and it’s not nearly as mature as software like NoSQL and columnar databases—at least yet. Still, NewSQL is worth keeping an eye on. Too many brilliant people like “database high priest” Mike Stonebra are working on NewSQL62 to ignore it altogether.

Columnar Databases

Way back in 1996, enterprise storage company Sybase IQ recognized the long-term limitations of the relational data model. In that year, Sybase launched the first columnar database. Today, it has plenty of company. Vertica (acquired by HP in 2012), newcomers like Infobright and ParAccel, and vendors like Teradata and Oracle have all either developed or acquired column-oriented databases.

Some people have a difficult time understanding the need for columnar databases, primarily because they are so accustomed to thinking about data in terms of rows and relational tables. Consider the following example. Let’s say that a customer table contains 25 fields and 100,000 records. (See Table 1.1 for a simplified version of such a table.) A business needs to determine which products sold on what date and in what store. In a row-based table, one can certainly query and count all zip codes, product stock-keeping units (SKUs), and sale dates. That’s easy enough to do, and I’ve written many thousands of queries like that in my consulting career. However, behind the scenes, that SQL statement needs to look at each field in each database record. Even though only three fields of that record concern us, the other 22 have to come along for the ride. In order words, the query needs to look at customer name, address, account number, and the like 100,000 times despite the fact that we really don’t care about that data.63

Now, truth be told, on a relatively tiny table like this, such a technical limitation has a negligible impact on performance, especially with this type of structured, transactional data. I would never recommend to an organization with data of this size that it buy and implement a columnar database (and transform its data in the process) just to shave one second off of a five-second query run weekly. There’s just no point. Traditional RDBMSs and SQL work just fine with Small Data, especially the structured type.

But forget tables with 100,000 records with 5 or 10 or 20 fields. I’ll see you and raise you. An SQL statement may work reasonably well on large tables with 50 or even 200 million records of this type of transactional data, but not with tables of 1 billion or more. What happens when you have to analyze terabytes or even petabytes of unstructured data? Bottom line: here traditional tools just don’t cut it.

Perhaps the single greatest limitation of row-oriented databases is speed. No one is going to wait 24 hours as a traditional SQL statement examines every field in every row, especially when time is of the essence. As an alternative, organizations are increasingly using faster-performing columnar databases. For a more technical explanation of why columnar databases offer superior performance relative to RDBMSs, consider the words of William McKnight, a longtime data-warehousing expert:

Columnar storage reduces input/output (I/O), which has gradually become the unquestioned bottleneck in analytic systems today. As you will see, columnar databases not only greatly limit the volume of storage going through the I/O channels, but also help ensure that whatever does go through I/O is very useful to building the query results. The I/O has become the bottleneck over the years due to increasing data sizes and the overwhelming need to consume that data. All data that is part of the I/O consumes resources that other I/Os cannot consume. If that data is page metadata or columns clearly uninteresting to the query, it is still consuming I/O resource. Hence, the I/Os bottleneck.37

Let’s extend McKnight’s comments a bit further. When considering whether to purchase and deploy a columnar database, the overriding question is not whether standard SQL statements and RDBMSs can theoretically handle some types of Big Data. At a high level, the answer is a highly qualified yes. I’ll grant that some Small Data tools can technically handle relatively small amounts of Big Data. Even if an organization can live with suboptimal performance of smaller data-sets, there’s another factor to consider: cost. Columnar databases offer far superior data compression compared to their row-based counterparts—often seven to eight times better. Greater compression means lower data storage costs. While these costs have certainly plummeted, they remain a significant expense for many IT departments.

Before concluding this section, it’s important to note that organizations may not have to decide between row- and column-orientation. (Yes, databases can use either one, although most today are “long,” not “wide.”) Sure, organizations may find it necessary to store and manage very different types of data in very different types of databases. That doesn’t change the fact that it’s messy. What if organizations could manage both Big Data and Small Data in the same place?

I’m far from the only one asking that question. As of late, the distinction between columns and rows has begun to blur. For instance, version 14 of Teradata’s database “supports both row-store and column-oriented approaches. EMC Greenplum, Microsoft (via the most recent incarnation of SQL Server), and Aster Data (now owned by Teradata) have also recently blended row-store and column-store capabilities.”38 Today, organizations and IT departments have greater choice than ever as more and more Big Data solutions and services emerge.

Google: Following the Amazon Model?

Some people don’t realize the size of Amazon’s other (read: non-book) lines of business. Excluding books, Amazon sells 160 million products on its website as of mid-2012. To power so much traffic, commerce, and, above all, data, the company has built immense data centers and bet heavily on cloud computing. For nearly two decades now, Jeff Bezos has run a future-oriented company, often to the chagrin of profit-hungry investors. He has always invested in the long term and understands all too well the power of scale. (It’s interesting to note that Amazon offers its own Big Data play. Redshift is a fast and powerful, fully managed, petabyte-scale data warehouse service in the cloud.)

Amazon doesn’t need to use all the compute power generated by its servers and data centers. Rather than letting it fly into the ether and go to waste, Amazon decided to sell that excess to businesses of all types. In early 2006, the company launched Amazon Web Services (AWS) and, after a few years of experimentation and pricing refinements, it’s been nothing less than a blockbuster. While Amazon won’t split revenue and profits from its different lines of business, Wall Street analysts believe that, in 2011, the company made nearly $750 million in essentially pure profit from AWS. In 2014, that number could be as high as $2.5 billion.39

At Google, we may well be seeing a parallel, Amazon-like pattern playing itself out with respect to Big Data. Google has developed a number of powerful Big Data tools that helps it store, access, interpret, and retrieve ungodly amounts of data. Like Amazon seven years ago, Google has realized that some of its own internal information needs may match those of many companies today. (Hmmm, maybe there’s a business here?) To that end, Google has made some of its internal tools available for third parties to use for free or license. This section describes some of these tools. (In point of fact, we already saw early in this chapter how Hadoop can trace much of its origins to Google tools: MapReduce and the Google file system come to mind.)

At present, some of the bullets in the Google Big Data chamber are presented in Table 4.2.

Table 4.2 Google Big Data Tools

Tool	Description
BigQuery	Enables users to “run SQL-like queries against very large datasets, with potentially billions of rows. This can be your own data, or data that someone else has shared for you. BigQuery works best for interactive analysis of very large datasets, typically using a small number of very large, append-only tables.”40
BigTable	Google’s seminal NoSQL database, BigTable is designed to scale into the petabyte range across “hundreds or thousands of machines, and to make it easy to add more machines [to] the system and automatically start taking advantage of those resources without any reconfiguration.”41
Dremel	A “scalable, interactive ad-hoc query system for analysis of read-only nested data. By combining multi-level execution trees and columnar data layout, it is capable of running aggregation queries over trillion-row tables in seconds. The system scales to thousands of CPUs and petabytes of data, and has thousands of users at Google.”42 In English, this means that Google seems to have figured out a way to make Big Data look more like Small Data, especially with respect to querying times.43
MapReduce	As discussed in the “Hadoop” section earlier in this chapter, MapReduce served as the initial foundation for Hadoop.

WEBSITES, START-UPS, AND WEB SERVICES

More mature and established tools do the lion’s share of the Big Data work in many organizations. However, we are living in an era of open-source software, cloud computing, open APIs, and historically low start-up costs. As such, a great deal of exciting innovation is occurring in the Big Data world. As we’ll see in the following section, some interesting start-ups are taking Big Data to the masses, trying to solve important business and societal problems in the process.

This section presents a few of the many companies pushing the Big Data envelope. Note that this is in no way a comprehensive list; it merely represents some of the highlights of my research in writing this book.

Kaggle

What if you created a company that was equal parts funding platform (like Kickstarter and IndieGoGo), crowdsourcing company (like Innocentive), social network, wiki, and job board (like Monster or Dice)? And what if you added a dash of gamification64 to this little stew of a company? You’d wind up with Kaggle:

an innovative solution for statistical/analytics outsourcing. [It is] the leading platform for predictive modeling competitions. Companies, governments and researchers present datasets and problems—the world’s best data scientists then compete to produce the best solutions. At the end of a competition, the competition host pays prize money in exchange for the intellectual property behind the winning model.

The motivation behind Kaggle is simple: most organizations don’t have access to the advanced machine learning and statistical techniques that would allow them to extract maximum value from their data. Meanwhile, data scientists crave real-world data to develop and refine their techniques. Kaggle corrects this mismatch by offering companies a cost-effective way to harness the “cognitive surplus” of the world’s best data scientists.44

Founded in April 2010 by Anthony Goldbloom and Jeremy Howard, Kaggle seeks to make data science a sport. It is a mesmerizing hybrid of a company, and I could write an entire book about it. Anyone can post a project by selecting an industry, type (public or private), participatory level (team or individual), reward amount, and timetable. A look at some existing Kaggle competitions45 shows an amazing variety of current contests, including the following:

Medicine: Merck Molecular Activity Challenge: To help develop safe and effective medicines by predicting molecular activity.
Politics: Follow the Money: Investigative Reporting Prospect: To find hidden patterns, connections, and ultimately compelling stories in a treasure trove of data about U.S. federal campaign contributions.
Financial: Benchmark Bond Trade Price Challenge: To develop models to accurately predict the trade price of a bond.
Science: Mapping Dark Matter (Supported by NASA and the Royal Astronomical Society): A cosmological image analysis competition to measure the small distortion in galaxy images caused by dark matter.

The San Francisco–based company attempts to match those with rare skills and interests (data scientists) with people and organizations with real problems to solve—and the data that, in the right hands, could yield a solution. Sound niche? Well, it’s a pretty big niche. Kaggle sports a community of more than 40,000 predictive modeling and machine-learning experts from 100 countries and 200 universities.

Before starting a project, you need to have a goal in mind (see previous list). But what if you just own a bunch of data and aren’t sure about what to do with it? You suspect that there’s some value in it, but what do you know? You’re not a data scientist. Kaggle’s got you covered with Prospect, a way of asking its community what to do with your trove of information. Users can suggest uses for owners of large datasets.

Other Start-Ups

Organizations looking to date before they get married to Big Data (see Table 4.3) might want to consider 1010 data. The company specializes in cloud-based analytics and claims that it can quickly analyze trillions of records. For their part, the folks at Precog believe that Big Data need not mean Big Complexity. Company CEO John De Goes is on record saying that “Hadoop is stupid.” Not surprisingly, the company has bucked the trend to follow others that are building on top of Hadoop. Instead, Precog chose to create its proprietary-only Big Data solution.46 The goal: to make Big Data less complex than current alternatives.

Table 4.3 Is Big Data Worth It? Hardware Considerations

Is Big Data Worth It?	Hardware Considerations
Yes	It’s full steam ahead. You’ve run the numbers, and the business case for Big Data is a no-brainer. You certainly can make the requisite investments in necessary hardware and software, but don’t rule out the cloud. Just because you can do everything in-house doesn’t mean that you should.
No	One of two things has happened. Either: 1) this book hasn’t sold you on Big Data; or 2) you buy into the benefits of Big Data but your organization is just not ready to go down this road—and you know it. Fear not. It’s much better to realize this now, not after massive IT expenditures, expensive consultants, and project failures. Reevaluate in six months. Wait until your organization matures, the expected benefits increase, or the costs decline.*
Maybe/Not Sure	Date before you get married. Forget about massive hardware upgrades and purchases for the time being, unless you need to make them for other essential business purposes. Try using lower-cost, secure, cloud-based services to see if the squeeze is worth the juice. Understand that, in the long term, it may well be cheaper to move to an on-premise solution.
* Note that Big Data need not be deployed only at an organizational level. Departments, teams, groups, and divisions may still benefit a great deal from Big Data solutions even if the organization isn’t ready to jump in.

I could list a dozen more start-ups here, each of which promises a slightly different take on Big Data, analytics, and other topics covered in this chapter. While I’m not completely oblivious to the VC world, I’ll let a true expert have the final say on the matter in this section.

A great deal of activity, investment, and innovation is taking place in the Big Data space. As with any emerging trend, it is sometimes difficult to separate fact from fiction. To that end, I reached out to Brad Feld, an early stage investor and entrepreneur for more than twenty years.

BIG DATA: A VENTURE CAPITALIST’S PERSPECTIVE

Like many technology terms, “Big Data” has progressed on the typical curve from a clever description of a set of technologies to a buzzword that is overused and now refers to anything and everything. It also has suddenly and magically appeared as something new and mystical, replacing historic (and perfectly descriptive, but worn-out) terms like analytics, which replaced terms from the 1970s like executive information systems and data warehouses.

Twenty years from now, the thing we call Big Data will be tiny data. It’ll be microscopic data. The volume that we’re talking about today, in 20 years, is a speck.

When everything is suddenly Big Data, it gives us an opportunity to redefine what it means and what matters about the term. And that’s where we find ourselves today, as Big Data has merely become a label describing a phenomenon. Since everything is Big Data, we can focus on what Big Data actually means, and what the implications of it are.

I encourage entrepreneurs to go one level deeper than the surface definition and talk about what they are doing in the context of an exponentially increasing amount of data. There are a lot of entrepreneurs who assert that they’re doing magical new things with data, that we can’t tell you about because it’s so incredible, and just trust this black box and give us some money—and we’ll give you amazing things out the other end. This is garbage, and the more hand waving there is, the more nonsensical the entrepreneur.

Instead, assume that there is a broad phenomenon of exponentially increasing data and that it will continue for a long time. Assume that the connections between machines, applications, and networks are progressing faster than many of us are really aware of or can comprehend. Assume that the machines have already taken over and are just waiting patiently for us. They have no incentive to exterminate us—humans killing humans is a human construct, not a machine construct, so assume the machines are going to treat us as pets.

Sounds silly, right? Yet today’s smartphones would have been called supercomputers 30 years ago. And what today’s smartphone can do pales in comparison to what it’ll be able to do in a couple of years, if it’s even called a smartphone anymore. Why do we even need a smartphone? Shouldn’t we be able to assume that we’ll have implants that are processing all the data we want in real time?

As the amount of data generated and consumed continues to expand geometrically, across many different vectors, the software that processes this data will continue to have to evolve. Whether this is at the data level, the system level, or the application level, there will regularly be new technological approaches to handling hundreds or thousands of times more data in the same time frame.

This phenomenon simply won’t stop in the foreseeable future. Imagine every cell in the body being instrumented and generating data about itself. What if every physical thing on the planet is able to act as a transmitter about a variety of data about it? And every other physical thing has the ability to process the inputs from these transmitters in real time. Now we are talking about Big Data.

As an investor, I’m interested in what people are doing at all three levels. I care a great deal about instrumenting things at the data level, such as the human being. My investment in Fitbit47 is an example of this—I believe that Fitbit is version 0.1 of our ability to fully instrument ourselves as humans. I care about this at system level—investments in companies like Mobiplug48 connect existing sensors together into one integrated whole. And I care about this at the application level, with investments in companies like Gnip49 connecting together all of the generators of real time data with the consumers of real time data.

We are at the very beginning of a Cambrian explosion of data. While it’s big right now, it’ll be ginormous before we know it. And then humongous after that. Don’t get tangled up in the buzzword. Focus on being involved in the phenomenon.

Brad Feld is currently the Managing Director at Foundry Group. He has penned several books, including the award-winning Startup Communities: Building an Entrepreneurial Ecosystem in Your City.

HARDWARE CONSIDERATIONS

At this point, it should be obvious that traditional, row-store databases just can’t handle Big Data—at least not in any meaningful way. As any IT professional knows, however, software is just part of the equation.

For a while now, organizations have been able to effectively obviate the need for purchasing and configuring expensive hardware by going to “the cloud.” In reality, though, many organizations will continue to host many of their own applications. Put differently, the cloud represents just another option for CIOs; it is hardly the only one. Each organization has to look at cost, security, and control issues when deciding whether or not to abandon the on-premise software model. This is as true with ERP and CRM applications as it is with Big Data.

Are hardware considerations being neglected in all the current Big Data hubbub? It’s a fair question to ask. To this end, ZDNet blogger Larry Dignan recently interviewed Univa CEO Gary Tyreman about the impact of Big Data on existing hardware and infrastructure. Univa bills itself as a high-performance computing (HPC) company that “provides the evolution of Grid Engine, the most widely deployed, distributed resource management software platform used by enterprises and research organizations across the globe.”50 The company counts NASA and Motorola Mobility among its clients. The following are excerpts from that interview:

Are hardware issues overlooked in all the Big Data talk?

I don’t know if they are forgetting or just not appreciating the challenges. Hadoop nodes today are 10 or less so it’s not hard to get it working. Companies are underestimating how much it takes to roll into production and get it running. In a nutshell, there’s a jump from a Hadoop pilot to actually scaling it.

What’s the solution?

Clusters today are one way to get Big Data environments set. The time has to be put in to configure the software behind the infrastructure, set storage, and fix network settings. If those configurations take two days it’s not a big deal, but then it is rolled into production and there are more complications.

Why isn’t hardware a consideration?

At this juncture, companies are primarily focused on the outcome of Big Data and what can be done. Enterprises need to focus on the outcome as well as what they want to know. Existing business intelligence tools also have to be considered.51

Let’s say that an organization’s existing hardware can support new applications like Hadoop. But a new installation is missing something big: the truly enormous amounts of data that Hadoop will be asked to store and interpret. Tyreman’s answer to the first question is indicative of a much larger IT problem: organizations tend to underestimate requirements across the board, and Big Data is no exception to this rule. At the same time, though, no chief information officer (CIO) wants to spend millions of dollars on hardware purchases and upgrades only to discover that the money could have been much better spent elsewhere. What to do?

Recognize that current Big Data tools continue to evolve and new applications pop up seemingly every day. The question isn’t whether the cost of “doing” Big Data will decline over time. The answer is clearly yes. With respect to Big Data, the intelligent organization will ask a much different and banal question: Are the expected benefits of getting on the Big Data train worth their perceived costs? To this query, there are three possible answers, represented in Table 4.3.

Brass tacks: Although Big Data does not require big hardware, organizations need to recognize that Big Data cannot just be “rolled up” or “folded into” existing Oracle, SQL Server, or DB2 databases. As a stopgap, they may want to consider Big Data appliances from vendors like Oracle52 and Teradata (Extreme Data Appliance).53 These appliances purport to load unstructured data into larger database tables and traditional data warehouses.

Bottom line: Organizations intent on using their own hardware to harness the power of Big Data will in all likelihood have to make some pretty big purchases. As Eric Savitz writes on forbes.com, “Traditional database approaches don’t scale or write data fast enough to keep up with the speed of creation. Additionally, purpose-designed data warehouses are great at handling structured data, but there’s a high cost for the hardware to scale out as volumes grow.”54 Know this going in.

THE ART AND SCIENCE OF PREDICTIVE ANALYTICS

As discussed before, we have never before seen data with the volume, variety, and velocity of today. Compared to ten years ago, many of today’s data sources and types may be different, but in a way nothing fundamental has changed. We’re still just trying to look at data to understand what’s going on—and why. With that information, we can derive knowledge to more confidently predict the future. Yes, the term Big Data is certainly new, but its objectives are not.

As we’ll see in the case studies in Chapter 5, new and improved data mining and predictive technologies are here—and organizations of all types are using them to do some pretty remarkable things.

Before concluding this chapter, permit me a few words on the limitations of predictive analytics. Yes, organizations can purchase and deploy best-of-breed applications such as those from vendors like SAS.55 You’ll get no argument from me on the merits of these solutions: you can only do so much with Small Data tools. However, those with unrealistic expectations are bound to be disappointed. One should not confuse reducing future uncertainty with eliminating it. Not too many predictive models saw the Arab Spring coming in 2011. Yes, an organization may well be able to get a better handle around future sales, but those estimates will still be just estimates; they will not be perfect. Too many other variables are at play. On occasion, software salespeople have been known to stretch the truth. Take the loftiest of claims with more than a grain of salt.

Think of it this way: Coupled with Big Data, predictive tools can reduce the degree of uncertainty your organization faces (both generally speaking and with respect to individual business decisions). But make no mistake: analytics aren’t crystal balls. Organizations will always face some level of uncertainty, even those that use Big Data well. As discussed in Chapter 2, some things can and always will be understood only in hindsight.

SUMMARY

This chapter provided a summary of some of the main Big Data technologies, applications, platforms, and web services. Yes, volume, velocity, and variety of Big Data require new tools. Given that, we examined some of those specific solutions.

Because even very large amounts of structured and transactional Small Data pale in comparison to Big Data, companies require new solutions. To that end, technologies such as Hadoop, NoSQL, and columnar databases fill important needs. Also, keep an eye on emerging start-ups and new web- and cloud-based services like Kaggle. Collectively, these new tools are allowing organizations to take Big Data from theory to practice. Finally, even though highly sophisticated predictive applications can do amazing things, they are far from perfect.

Now that we know how these new technologies can be used, let’s turn to how they are actually being used. Chapter 5 presents three case studies of organizations using Big Data to achieve fascinating results.

NOTES

1. Personal conversation with Murnane, November 17, 2012.

2. “SAP Sybase IQ,” 2012, www.sybase.com/products/datawarehousing/sybaseiq, retrieved December 11, 2012.

3. “Big Data Solution Brief,” 2012, http://download.microsoft.com/download/F/A/1/FA126D6D-841B-4565-BB26-D2ADD4A28F24/Microsoft_Big_Data_Solution_Brief.pdf, retrieved December 11, 2012.

4. Harris, Derrick, “Startup Precog Says Big Data Doesn’t Need to Be So Complex,” September 27, 2012, http://gigaom.com/data/startup-precog-says-big-data-doesnt-need-to-be-so-complex/, retrieved December 11, 2012.

5. Walker, Michael, “Hadoop Technology Stack,” August 22, 2012, www.analyticbridge.com/profiles/blogs/hadoop-technology-stack, retrieved December 11, 2012.

6. Shapira, Gwen, “Hadoop and NoSQL Mythbusting,” October 4, 2011, www.pythian.com/news/27367/hadoop-and-nosql-mythbusting/, retrieved December 11, 2012.

7. Personal conversation with Kahler, November 26, 2012.

8. “Big Data,” 2012, www.talend.com/products/big-data, retrieved December 11, 2012.

9. “What Can Hadoop Do for You?,” 2012, www.hadoopworld.com/, retrieved December 11, 2012.

10. Metz, Cade, “Bread Baker Frees Software Secrets from Google Empire,” October 29, 2012, www.wired.com/wiredenterprise/2012/10/kornacker-cloudera-google, retrieved December 11, 2012.

11. Ibid.

12. “Architecting the Future of Big Data,” 2012, http://hortonworks.com/, retrieved December 11, 2012.

13. “MapR,” 2012, www.mapr.com/, retrieved December 11, 2012.

14. www.splunk.com/view/SP-CAAACVK, retrieved December 11, 2012.

15. Yasin, Rutrell, “How to Make Big Data More Useful, Reliable—and Fast,” November 5, 2012, http://gcn.com/articles/2012/11/05/splunk-big-data-useful-reliable-fast.aspx, retrieved December 11, 2012.

16. Harris, Derrick, “Hadapt Raises $9.5M for Hadoop Data Warehouse,” October 21, 2011, http://gigaom.com/cloud/hadapt-raises-9-5m-for-hadoop-data-warehouse/, retrieved December 11, 2012.

17. “Product | Hadapt,” 2012, http://hadapt.com/product/, retrieved December 11, 2012.

18. “RainStor: “Cost-Effective Big Data Management,” 2012, http://rainstor.com/products/overview, retrieved December 11, 2012.

19. Harris, Derrick, “RainsStor Raises $12M to Make Your Big Data Small,” October 4, 2012, http://gigaom.com/data/rainstor-raises-12m-to-turn-your-big-data-small/, retrieved December 11, 2012.

20. Harris, Derrick, “Platfora Gets $5.7M to Make Hadoop Mainstream,” September 8, 2011, http://gigaom.com/cloud/platfora-gets-5-7m-to-make-hadoop-mainstream/, retrieved December 11, 2012.

21. “Yahoo! Finance: Oracle Corporation (ORCL),” 2012, http://finance.yahoo.com/q?s=ORCL, retrieved December 11, 2012.

22. “Oracle and Big Data,” 2012, www.oracle.com/us/technologies/big-data/index.html, retrieved December 11, 2012.

23. “Oracle Loader for Hadoop,” 2012, www.oracle.com/technetwork/bdc/hadoop-loader/overview/index-1454316.html, retrieved December 11, 2012.

24. “IBM and Linux,” 2012, www.-03.ibm.com/linux/, retrieved December 11, 2012.

25. “Shared Source Initiative,” 2012, www.microsoft.com/en-us/sharedsource/default.aspx, retrieved December 11, 2012.

26. “SAP Further Extends Real-Time Data Platform with ‘Big Data’ Capabilities,” May 16, 2012, www.sap.com/corporate-en/press.epx?pressid=18920, retrieved December 11, 2012.

27. Microsoft Corporation, “Big Data: Microsoft SQL Server,” November 2, 2012, www.microsoft.com/sqlserver/en/us/solutions-technologies/business-intelligence/big-data.aspx, retrieved December 11, 2012.

28. Harris, Derrick, “EMC Throws Lots of Hardware at Hadoop,” September 20, 2011, http://gigaom.com/cloud/emc-throws-lots-of-hardware-at-hadoop/, retrieved December 11, 2012.

29. Vijayan, Jaikumar, “IT Must Prepare for Hadoop Security Issues,” November 9, 2011, www.computerworld.com/s/article/9221652/IT_must_prepare_for_Hadoop_security_issues, retrieved December 11, 2012.

30. Kobielus, James, “True Hadoop Standards Are Essential for Sustaining Industry Momentum: Part 1,” October 9, 2012, http://ibmdatamag.com/2012/10/true-hadoop-standards-are-essential-for-sustaining-industry-momentum-part-1/, retrieved December 11, 2012.

31. Miller, Mike, “Why the Days Are Numbered for Hadoop as We Know It,” July 7, 2012, http://gigaom.com/cloud/why-the-days-are-numbered-for-hadoop-as-we-know-it/, retrieved December 11, 2012.

32. Vardanyan, Mikayel, “Picking the Right NoSQL Database Tool at Uptime and Performance Tips,” May 22, 2011, http://blog.monitis.com/index.php/2011/05/22/picking-the-right-nosql-database-tool/, retrieved December 11, 2012.

33. Harrison, Guy, “10 Things You Should Know About NoSQL Databases,” August 26, 2010, www.techrepublic.com/blog/10things/10-things-you-should-know-about-nosql-databases/1772, retrieved December 11, 2012.

34. “Welcome to Apache Cassandra,” November 30, 2012, http://cassandra.apache.org/, retrieved December 11, 2012.

35. Hobbs, Tyler, “Third Party Support: Cassandra Wiki,” September 6, 2012, http://wiki.apache.org/cassandra/ThirdPartySupport, retrieved December 11, 2012.

36. “VoltDB: NewSQL Benefits,” August 18, 2012, http://voltdb.com/blog, retrieved December 11, 2012.

37. Columnar Databases by William McKnight, published on BeyeNETWORK.com. Read the full article at http://www.b-eye-network.com/view/15506.

38. Henschen, Doug. “Big Data: Fast, Complex, Varied, Costly.” Information Week. October, 2011. 28.

39. Hickey, Andrew R., “Amazon Cloud Revenue Could Exceed $500 Million in 2010: Report,” August 3, 2010, www.crn.com/news/applications-os/226500204/amazon-cloud-revenue-could-exceed-500-million-in-2010-report.htm;jsessionid=6qcAPhELp1er9Vy-CZZSqQ**.ecappj02, retrieved December 11, 2012.

40. “Google BigQuery,” November 14, 2012, https://developers.google.com/bigquery/docs/overview, retrieved December 11, 2012.

41. O’Reilly, Tim, “Database War Stories #7: Google File System and BigTable,” May 3, 2006, http://radar.oreilly.com/archives/2006/05/database_war_stories_7_google.html, retrieved November 11, 2012.

42. Melnik, Sergey; Gubarev, Andrey; Long, Jing Jing; Romer, Geoffrey; Shivakumar, Shiva; Tolton, Matt; Vassilakis, Theo, “Dremel: Interactive Analysis of Web-Scale Datasets,” 2010, http://research.google.com/pubs/pub36632.html, retrieved December 11, 2012.

43. Metz, Cade, “Google’s Dremel Makes Big Data Look Small,” August 16, 2012, www.wired.com/wiredenterprise/2012/08/googles-dremel-makes-big-data-look-small/, retrieved December 11, 2012.

44. “Kaggle: We’re Making Data Science a Sport,” 2012, www.kaggle.com, retrieved December 11, 2012.

45. “Kaggle: Competitions,” 2012, www.kaggle.com/competitions, retrieved December 11, 2012.

46. Harris, Derrick, “Startup Precog Says Big Data Doesn’t Need to Be So Complex,” September 27, 2012, http://gigaom.com/data/startup-precog-says-big-data-doesnt-need-to-be-so-complex, retrieved December 11, 2012.

47. www.fitbit.com, retrieved December 11, 2012.

48. www.mobiplug.co, retrieved December 11, 2012.

49. www.gnip.com, retrieved December 11, 2012.

50. “Our Story,” 2012, www.univa.com/about, retrieved December 11, 2012.

51. Dignan, Larry, “Big Data Projects: Is the Hardware Infrastructure Overlooked?,” October 18, 2012, www.zdnet.com/big-data-projects-is-the-hardware-infrastructure-overlooked-7000005940/, retrieved December 11, 2012.

52. “Oracle Big Data Appliance,” 2012, www.oracle.com/us/products/database/big-data-appliance/overview/index.html, retrieved December 11, 2012.

53. “Teradata Extreme Data Appliance: Deep-Dive Analytics at an Entry-Level Price,” 2012, www.teradata.com/extreme-data-appliance/#tabbable=0&tab1=0&tab2=0&tab3=0, retrieved December 11, 2012.

54. Bantleman, John, “The Big Cost of Data,” April 16, 2012, www.forbes.com/sites/ciocentral/2012/04/16/the-big-cost-of-big-data/, retrieved December 11, 2012.

55. “Predictive Analytics and Data Mining,” 2012, www.sas.com/technologies/analytics/datamining/index.html, retrieved December 11, 2012.

56. This will be a high-level overview, not a how-to section.

57. Facebook used Hadoop to create strategic analytic applications involving massive volumes of user data.

58. At its highest level, a file system organizes data on a storage device for future retrieval.

59. Some people consider Hadoop a NoSQL solution, while others don’t. For the sake of organization, this section focuses on Hadoop alternatives.

60. Note that VoltDB does not “own” NewSQL. NewSQL is a very broad term.

61. A data store is a repository of integrated objects that contain data.

62. For more on NewSQL, check out http://newsql.sourceforge.net.

63. Of course, as any good DBA will tell you, it’s not hard to create separate temp tables and perform other back-end tricks, but let’s keep the example simple here.

64. The site lets users rate other users.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 4: Big Data Solutions

Create new playlist

Sign In

Sign Up

Table of Contents for
Chapter 4: Big Data Solutions