Chapter 1. Big Data Overview

Everything an organization or an individual does has a digital footprint that can be used or analyzed. Quite simply, big data refers to the usage of an ever increasing volume of data.

In this chapter, we will provide an overview of big data and Hadoop with the help of following topics:

  • What is big data?
  • Big data technology evolution
  • Big data landscape
  • Careers in big data
  • Hadoop architecture
  • The Hadoop jungle explained
  • Hadoop distributions

What is big data?

There are many definitions of big data provided by different consultancies and IT providers. Here is a shortlist of two of the best definitions.

The first and the coolest definition is "data that is so big it cannot be processed using traditional data processing tools and applications".

Second, the most professionally accepted definition is the famous 3V one—"3Vs (volume, variety, and velocity) are the three defining properties or dimensions of big data. Volume refers to the amount of data, variety refers to the number of types of data, and velocity refers to the speed of data processing."

Data volume

There is an ongoing debate on data volume—exactly what can be classified as big data? But unfortunately there is no such defined rule to classify it. For a small company that deals with gigabytes of data, terabytes are considered big. For a large company that deals with terabytes of data, petabytes are considered big. But, in any case, we are talking of data at least more than a terabyte.

The size of data is growing at an exponential rate for both individuals and organizations. Also, no one wants to throw away data, especially when the price of hard disks has been dropping consistently. In addition to a desire to store endless data, corporations also want to analyze it. The data is in different forms—structured data such as massive transactional data, semistructured data such as documents, and unstructured data such as images and videos.

Data velocity

Until around 5 years ago, organizations used to extract, transform, and load (ETL) large amounts of data in daily batches into data warehouses or data marts with business intelligence and analytics tools on top of that. Now, with more real-time data sources such as messages, social media, and transactions, the processing of data in daily batches does not add much value. Business is now much more competitive and increasingly online. The increased speed of data generation and business needs has increased data velocity to as fast as real time.

With memory and hard disks getting cheaper and machines faster as every single day passes, the expectation to see real-time data analytics has never been greater than it is now.

Data variety

Earlier, data was mostly in the form of structured databases, which were relatively easier to process using traditional data integration and analytics tools.

But now businesses want to process heterogeneous types of data–anything from Excel tables and databases to pure text, pictures, videos, web data, GPS data, sensor data, documents, mobile messages, or even papers that can be scanned and transformed to electronic format.

Organizations must adapt to the new data formats and be able to process and analyze them.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.219.220.22