Managing Big Data in the Cloud

Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Perhaps you’ve heard of the term big data? It’s getting a lot of attention recently because of the ongoing need to process increasing amounts of data. The key fact about big data is that it exists at the tipping point of the workarounds that organizations have historically put in place to manage large volumes of complex data. Big data technologies allow people to actually analyze and utilize this data in an effective way.

Master data management (MDM)

MDM is a technique that helps companies establish consistent and accurate definitions of data across their IT assets. It is about breaking down data silos to reach a single version of the truth. For example, you might have multiple systems in your company, each containing customer information. But what is a customer? In the pharmaceutical industry, in one system, a customer might refer to a physician. In another system, it might be a group practice. In yet another system, it might be a hospital that buys drugs in bulk. So, if you’re mapping this data together, you need to understand what you’re calling a customer or else you’ll end up with data that doesn’t make any sense when you go to analyze it. That’s what master data is all about.

Big data characteristics

Big data generally has three characteristics:

Volume: Big data is big in volume, and although the word big is relative here, currently we’re talking on the order of at least terabytes. Many big data implementations are looking to analyze petabytes of information.

Name	Value
Byte	10**0
Gigabyte	10**9 bytes
Terabyte	10**12 bytes
Petabyte	10**15 bytes
Exabyte	10**18 bytes

Variety: Big data comes in different shapes and sizes. It includes these types of data:

• Structured data is the typical kind of data that analysts are used to dealing with. It includes revenue, number of sales — the type of data you think about including in a database. Structured data is also being produced in new ways in products such as sensors and RFID tags.

• Semistructured data has some structure to it but not in the way you think about tables in a database. It includes EDI formats or XML.

• Unstructured data includes text, image, and audio, including any document, e-mail message, tweet, or blog internal to a company or on the Internet. Unstructured data accounts for about 80 percent of all data.

Velocity: This is the speed at which the data moves. Think about sensors capturing data every millisecond or data streams output from medical equipment. Big data often comes at you in a stream, so it has a real-time nature associated with it.

The cloud is an ideal place for big data because of its scalable storage, compute power, and elastic resources. The cloud model is large-scale; distributed computing and a number of frameworks and technologies have emerged to support this model, including

Apache Hadoop: An open source distributed computing platform written in Java. It is a software library that enables distributed processing across clusters of computers. It’s really a distributed file system. It creates a computer pool, each with a Hadoop file system. Hadoop was designed to deal with problems where there’s a large amount of complex data. The data can be structured, unstructured, or semistructured. Hadoop can run across a lot of servers that don’t share memory or disk. For more information about Hadoop, visit http://hadoop.apache.org .

MapReduce: A software framework introduced by Google to support distributed computing on large sets of data. It’s at the heart of what Hadoop is doing with big data and big data analytics. It’s designed to take advantage of cloud resources. This computing is done across numerous computers, called clusters. Each cluster is referred to as a node. MapReduce can deal with both structured and unstructured data. Users specify a map function that processes a key/value pair to generate a set of intermediate pairs and a reduction function that merges these pairs.

Big data databases

One important appeal of Hadoop is that it can handle different types of data. Parallel database management systems have been on the market for decades. They can support parallel execution because most of the tables are partitioned over the nodes in a cluster, and they can translate SQL commands into a plan that is divided across the nodes in the cluster. However, they mostly deal with structured data because it’s hard to fit unstructured, freeform data into the columns and rows in a relational model.

Hadoop has started a movement in what has been called NoSQL, meaning not only SQL. The term refers to a set of technologies that is different than relational database systems. One major difference is that they don’t use SQL. They are also designed for distributed data stores. There are numerous examples of these kinds of databases, including the following:

Apache Cassandra: An open source distributed data management system originally developed by Facebook. It has no stringent structure requirements, so it can handle all different types of data. Experts claim it excels at high-volume, real-time transaction processing. Other open source databases include MongoDB, Apache CouchDB, and Apache HBase.

Amazon Simple DB: Amazon likens this database to a spreadsheet in that it has columns and rows with attributes and items stored in each. Unlike a spreadsheet, however, each cell can have multiple values, and each item can have its own set of associated attributes. Amazon then automatically indexes the data. Recently, Amazon announced Amazon Dynamo DB as a way to bring big data NoSQL to the cloud.

Google Big Table: This hybrid is sort of like one big table. Because tables can be large, they’re split at the row boundaries into tables, which might be hundreds of megabytes or so. MapReduce is often used for generating and modifying data stored in BigTable.

NoSQL doesn’t mean that people should not be using SQL. Rather, the idea is that, depending on what your problem is, relational databases and NoSQL databases can coexist in an organization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Managing Big Data in the Cloud

Create new playlist

Sign In

Sign Up

Table of Contents for
Managing Big Data in the Cloud