Chapter 12. Working with Big Data

The amount of data is increasing at exponential rates. Today's systems are generating and recording information on customer behavior, distributed systems, network analysis, sensors and many, many more sources. While the current big trend of mobile data is pushing the current growth, the next big thing—the Internet of Things (IoT)—is going to further increase the rate of growth.

What this means for data mining is a new way of thinking. The complex algorithms with high run times need to be improved or discarded, while simpler algorithms that can deal with more samples are becoming more popular to use. As an example, while support vector machines are great classifiers, some variants are difficult to use on very large datasets. In contrast, simpler algorithms such as logistic regression can manage more easily in these scenarios.

In this chapter, we will investigate the following:

  • Big data challenges and applications
  • The MapReduce paradigm
  • Hadoop MapReduce
  • mrjob, a python library to run MapReduce programs on Amazon's infrastructure

Big data

What makes big data different? Most big-data proponents talk about the four Vs of big data:

  1. Volume: The amount of data that we generate and store is growing at an increasing rate, and predictions of the future generally only suggest further increases. Today's multi-gigabyte sized hard drives will turn into exabyte hard drives in a few years, and network throughput traffic will be increasing as well. The signal to noise ratio can be quite difficult, with important data being lost in the mountain of non-important data.
  2. Velocity: While related to volume, the velocity of data is increasing too. Modern cars have hundreds of sensors that stream data into their computers, and the information from these sensors needs to be analyzed at a subsecond level to operate the car. It isn't just a case of finding answers in the volume of data; those answers often need to come quickly.
  3. Variety: Nice datasets with clearly defined columns are only a small part of the dataset that we have these days. Consider a social media post, which may have text, photos, user mentions, likes, comments, videos, geographic information, and other fields. Simply ignoring parts of this data that don't fit your model will lead to a loss of information, but integrating that information itself can be very difficult.
  4. Veracity: With the increase in the amount of data, it can be hard to determine whether the data is being correctly collected—whether it is outdated, noisy, contains outliers, or generally whether it is useful at all. Being able to trust the data is hard when a human can't reliably verify the data itself. External datasets are being increasingly merged into internal ones too, giving rise to more troubles relating to the veracity of the data.

These main four Vs (others have proposed additional Vs) outline why big data is different to just lots-of-data. At these scales, the engineering problem of working with the data is often more difficult—let alone the analysis. While there are lots of snake oil salesmen that overstate the ability to use big data, it is hard to deny the engineering challenges and the potential of big-data analytics.

The algorithms we have used are to date load the dataset into memory and then to work on the in-memory version. This gives a large benefit in terms of speed of computation, as it is much faster to compute on in-memory data than having to load a sample before we use it. In addition, in-memory data allows us to iterate over the data many times, improving our model.

In big data, we can't load our data into memory, In many ways, this is a good definition for whether a problem is big data or not—if the data can fit in the memory on your computer, you aren't dealing with a big data problem.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.144.15.43