Chapter 9. Big Data Machine Learning – The Final Frontier

In recent years, we have seen an exponential growth in data generated by humans and machines. Varied sources, including home sensors, healthcare-related monitoring devices, news feeds, conversations on social media, images, and worldwide commerce transactions—an endless list—contribute to the vast volumes of data generated every day.

Facebook had 1.28 billion daily active users in March 2017 sharing close to four million pieces of unstructured information as text, images, URLs, news, and videos (Source: Facebook). 1.3 billion Twitter users share approximately 500 million tweets a day (Source: Twitter). Internet of Things (IoT) sensors in lights, thermostats, sensor in cars, watches, smart devices, and so on, will grow from 50 billion to 200 billion by 2020 (Source: IDC estimates). YouTube users upload 300 hours of new video content every five minutes. Netflix has 30 million viewers who stream 77,000 hours of video daily. Amazon has sold approximately 480 million products and has approximately 244 million customers. In the financial sector, the volume of transactional data generated by even a single large institution is enormous—approximately 25 million households in the US have Bank of America, a major financial institution, as their primary bank, and together produce petabytes of data annually. Overall, it is estimated that the global Big Data industry will be worth 43 billion US dollars in 2017 (Source: www.statista.com).

Each of the aforementioned companies and many more like them face the real problem of storing all this data (structured and unstructured), processing the data, and learning hidden patterns from the data to increase their revenue and to improve customer satisfaction. We will explore how current methods, tools and technology can help us learn from data in Big Data-scale environments and how as practitioners in the field we must recognize challenges unique to this problem space.

This chapter has the following structure:

  • What are the characteristics of Big Data?
  • Big Data Machine Learning
    • General Big Data Framework:
      • Big data cluster deployments frameworks
      • HortonWorks Data Platform (HDP)
      • Cloudera CDH
      • Amazon Elastic MapReduce (EMR)
      • Microsoft HDInsight
    • Data acquisition:
      • Publish-subscribe framework
      • Source-sink framework
      • SQL framework
      • Message queueing framework
      • Custom framework
    • Data storage:
      • Hadoop Distributed File System (HDFS)
      • NoSQL
    • Data processing and preparation:
      • Hive and Hive Query Language (HQL)
      • Spark SQL
      • Amazon Redshift
      • Real-time stream processing
    • Machine Learning
    • Visualization and analysis
  • Batch Big Data Machine Learning
    • H2O:
    • H2O architecture
      • Machine learning in H2O
      • Tools and usage
      • Case study
      • Business problems
      • Machine Learning mapping
      • Data collection
      • Data sampling and transformation
      • Experiments, results, and analysis
    • Spark MLlib:
      • Spark architecture
      • Machine Learning in MLlib
      • Tools and usage
      • Experiments, results, and analysis
  • Real-time Big Data Machine Learning
    • Scalable Advanced Massive Online Analysis (SAMOA):
      • SAMOA architecture
      • Machine Learning algorithms
      • Tools and usage
      • Experiments, results, and analysis
      • The future of Machine Learning

What are the characteristics of Big Data?

There are many characteristics of Big Data that are different than normal data. Here we highlight them as four Vs that characterize Big Data. Each of these makes it necessary to use specialized tools, frameworks, and algorithms for data acquisition, storage, processing, and analytics:

  • Volume: One of the characteristic of Big Data is the size of the content, structured or unstructured, which doesn't fit the storage capacity or processing power available on a single machine and therefore needs multiple machines.
  • Velocity: Another characteristic of Big Data is the rate at which the content is generated, which contributes to volume but needs to be handled in a time sensitive manner. Social media content and IoT sensor information are the best examples of high velocity Big Data.
  • Variety: This generally refers to multiple formats in which data exists, that is, structured, semi-structured, and unstructured and furthermore, each of them has different forms. Social media content with images, video, audio, text, and structured information about activities, background, networks, and so on, is the best example of where data from various sources must be analyzed.
  • Veracity: This refers to a wide variety of factors such as noise, uncertainty, biases, and abnormality in the data that must be addressed, especially given the volume, velocity, and variety of data. One of the key steps, as we will discuss in the context of Big Data Machine Learning, is processing and cleaning such "unclean" data.

Many have added other characteristics such as value, validity, and volatility to the preceding list, but we believe they are largely derived from the previous four.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.250.203