Managing Big Data in the Cloud

Perhaps you’ve heard of the term big data? It’s getting a lot of attention recently because of the ongoing need to process increasing amounts of data. The key fact about big data is that it exists at the tipping point of the workarounds that organizations have historically put in place to manage large volumes of complex data. Big data technologies allow people to actually analyze and utilize this data in an effective way.

Big data characteristics

Big data generally has three characteristics:

check.png Volume: Big data is big in volume, and although the word big is relative here, currently we’re talking on the order of at least terabytes. Many big data implementations are looking to analyze petabytes of information.

Name

Value

Byte

10**0

Gigabyte

10**9 bytes

Terabyte

10**12 bytes

Petabyte

10**15 bytes

Exabyte

10**18 bytes

check.png Variety: Big data comes in different shapes and sizes. It includes these types of data:

Structured data is the typical kind of data that analysts are used to dealing with. It includes revenue, number of sales — the type of data you think about including in a database. Structured data is also being produced in new ways in products such as sensors and RFID tags.

Semistructured data has some structure to it but not in the way you think about tables in a database. It includes EDI formats or XML.

Unstructured data includes text, image, and audio, including any document, e-mail message, tweet, or blog internal to a company or on the Internet. Unstructured data accounts for about 80 percent of all data.

check.png Velocity: This is the speed at which the data moves. Think about sensors capturing data every millisecond or data streams output from medical equipment. Big data often comes at you in a stream, so it has a real-time nature associated with it.

The cloud is an ideal place for big data because of its scalable storage, compute power, and elastic resources. The cloud model is large-scale; distributed computing and a number of frameworks and technologies have emerged to support this model, including

check.png Apache Hadoop: An open source distributed computing platform written in Java. It is a software library that enables distributed processing across clusters of computers. It’s really a distributed file system. It creates a computer pool, each with a Hadoop file system. Hadoop was designed to deal with problems where there’s a large amount of complex data. The data can be structured, unstructured, or semistructured. Hadoop can run across a lot of servers that don’t share memory or disk. For more information about Hadoop, visit http://hadoop.apache.org .

check.png MapReduce: A software framework introduced by Google to support distributed computing on large sets of data. It’s at the heart of what Hadoop is doing with big data and big data analytics. It’s designed to take advantage of cloud resources. This computing is done across numerous computers, called clusters. Each cluster is referred to as a node. MapReduce can deal with both structured and unstructured data. Users specify a map function that processes a key/value pair to generate a set of intermediate pairs and a reduction function that merges these pairs.

Big data databases

One important appeal of Hadoop is that it can handle different types of data. Parallel database management systems have been on the market for decades. They can support parallel execution because most of the tables are partitioned over the nodes in a cluster, and they can translate SQL commands into a plan that is divided across the nodes in the cluster. However, they mostly deal with structured data because it’s hard to fit unstructured, freeform data into the columns and rows in a relational model.

Hadoop has started a movement in what has been called NoSQL, meaning not only SQL. The term refers to a set of technologies that is different than relational database systems. One major difference is that they don’t use SQL. They are also designed for distributed data stores. There are numerous examples of these kinds of databases, including the following:

check.png Apache Cassandra: An open source distributed data management system originally developed by Facebook. It has no stringent structure requirements, so it can handle all different types of data. Experts claim it excels at high-volume, real-time transaction processing. Other open source databases include MongoDB, Apache CouchDB, and Apache HBase.

check.png Amazon Simple DB: Amazon likens this database to a spreadsheet in that it has columns and rows with attributes and items stored in each. Unlike a spreadsheet, however, each cell can have multiple values, and each item can have its own set of associated attributes. Amazon then automatically indexes the data. Recently, Amazon announced Amazon Dynamo DB as a way to bring big data NoSQL to the cloud.

check.png Google Big Table: This hybrid is sort of like one big table. Because tables can be large, they’re split at the row boundaries into tables, which might be hundreds of megabytes or so. MapReduce is often used for generating and modifying data stored in BigTable.

remember.eps NoSQL doesn’t mean that people should not be using SQL. Rather, the idea is that, depending on what your problem is, relational databases and NoSQL databases can coexist in an organization.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.202.240