Preface

The purpose of data science is to transform the world using data, and this goal is mainly achieved through disrupting and changing real processes in real industries. To operate at that level we need to be able to build data science solutions of substance; ones that solve real problems, and which can run reliably enough for people to trust and act upon.

This book explains how to use Spark to deliver production grade data science solutions that are innovative, disruptive, and reliable enough to be trusted. Whilst writing this book it was the authors’ intention to deliver a work that transcends the traditional cookbook style: providing not just examples of code, but developing the techniques and mind-set that are needed to explore content like a master; as they say, Content is King! Readers will notice that the book has a heavy emphasis on news analytics, and occasionally pulls in other datasets such as Tweets and financial data. This emphasis on news is not an accident; much effort has been spent on trying to focus on datasets that offer context at a global scale.

The implicit problem that this book is dedicated to is the lack of data offering proper context around how and why people make decisions. Often, directly accessible data sources are very focused on problem specifics and, as a consequence, can be very light on broader datasets offering the behavioral context needed to really understand what’s driving the decisions that people make.

Considering a simple example where website users’ key information such as age, gender, location, shopping behavior, purchases and so on are known, we might use this data to recommend products based on what others “like them” have been buying.

But to be exceptional, more context is required as to why people behave as they do. When news reports suggest a massive Atlantic hurricane is approaching the Florida coastline, and could reach the coast in say 36 hours, perhaps we should be recommending products people might need. Items such as USB enabled battery packs for keeping phones charged, candles, flashlights, water purifiers, and the like. By understanding the context in which decisions are being made, we can conduct better science.

Therefore, whilst this book certainly contains useful code and, in many cases, unique implementations, it further dives deep into the techniques and skills required to truly master data science; some of which are often overlooked or not considered at all. Drawing on many years of commercial experience, the authors have leveraged their extensive knowledge to bring the real, and exciting world of data science to life.

What this book covers

Chapter 1, The Big Data Science Ecosystem, this chapter is an introduction to an approach and accompanying ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies that will be used in later chapters as well as introducing the environment and how to configure it appropriately. Additionally it explains some of the non-functional considerations relevant to the overall data architecture and long-term success.

Chapter 2, Data Acquisition, as a data scientist, one of the most important tasks is to accurately load data into a data science platform. Rather than having uncontrolled, ad hoc processes, this chapter explains how a general data ingestion pipeline in Spark can be constructed that serves as a reusable component across many feeds of input data.

Chapter 3, Input Formats and Schema, this chapter demonstrates how to load data from its raw format onto different schemas, therefore enabling a variety of different kinds of downstream analytics to be run over the same data. With this in mind, we will look at the traditionally well-understood area of data schemas. We will cover key areas of traditional database modeling and explain how some of these cornerstone principles are still applicable to Spark today. In addition, whilst honing our Spark skills we will analyze the GDELT data model and show how to store this large dataset in an efficient and scalable manner.

Chapter 4, Exploratory Data Analysis, a common misconception is that an EDA is only for discovering the statistical properties of a dataset and providing insights about how it can be exploited. In practice, this isn’t the full story. A full EDA will extend that idea, and include a detailed assessment of the “feasibility of using this Data Feed in production.” It requires us to also understand how we would specify a production grade data loading routine for this dataset, one that might potentially run in a “lights out mode” for many years. This chapter offers a rapid method for doing Data Quality assessment using a “data profiling” technique to accelerate the process.

Chapter 5, Spark for Geographic Analysis, geographic processing is a powerful new use case for Spark, and this chapter demonstrates how to get started. The aim of this chapter is to explain how Data Scientists can process geographic data, using Spark, to produce powerful map based views of very large datasets. We demonstrate how to process spatio-temporal datasets easily via Spark integrations with Geomesa, which helps turn Spark into a sophisticated geographic processing engine. The chapter later leverages this spatio-temporal data to apply machine learning with a view to predicting oil prices.

Chapter 6, Scraping Link-Based External Data, this chapter aims to explain a common pattern for enhancing local data with external content found at URLs or over APIs, such as GDELT and Twitter. We offer a tutorial using the GDELT news index service as a source of news URLS, demonstrating how to build a web scale News Scanner that scrapes global breaking news of interest from the internet. We further explain how to use the specialist web-scraping component in a way that overcomes the challenges of scale, followed by the summary of this chapter.

Chapter 7, Building Communities, this chapter aims to address a common use case in data science and big data. With more and more people interacting together, communicating, exchanging information, or simply sharing a common interest in different topics, the entire world can be represented as a Graph. A data scientist must be able to detect communities, find influencers / top contributors, and detect possible anomalies.

Chapter 8, Building a Recommendation System, if one were to choose an algorithm to showcase data science to the public, a recommendation system would certainly be in the frame. Today, recommendation systems are everywhere; the reason for their popularity is down to their versatility, usefulness and broad applicability. In this chapter, we will demonstrate how to recommend music content using raw audio signals.

Chapter 9, News Dictionary and Real-Time Tagging System, while a hierarchical data warehouse stores data in files of folders, a typical Hadoop based system relies on a flat architecture to store your data. Without a proper data governance or a clear understanding of what your data is all about, there is an undeniable chance of turning data lakes into swamps, where an interesting dataset such as GDELT would be nothing more than a folder containing a vast amount of unstructured text files. In this chapter, we will be describing an innovative way of labeling incoming GDELT data in a non-supervised way and in near real time.

Chapter 10, Story De-duplication and Mutation, in this chapter, we de-duplicate and index the GDELT database into stories, before tracking stories over time and understanding the links between them, how they may mutate and if they could lead to any subsequent events in the near future. Core to this chapter is the concept of Simhash to detect near duplicates and building vectors to reduce dimensionality using Random Indexing.

Chapter 11, Anomaly Detection and Sentiment Analysis, perhaps the most notable occurrence of the year 2016 was the tense US presidential election and its eventual outcome: the election of President Donald Trump, a campaign that will long be remembered; not least for its unprecedented use of social media and the stirring up of passion among its users, most of whom made their feelings known through the use of hashtags. In this chapter, instead of trying to predict the outcome itself, we will aim to detect abnormal tweets during the US election using a real-time Twitter feed.

Chapter 12, TrendCalculus, long before the concept of “what’s trending” became a popular topic of study by data scientists, there was an older one that is still not well served by data science; it is that of Trends. Presently, the analysis of trends, if it can be called that, is primarily carried out by people “eyeballing” time series charts and offering interpretations. But what is it that people’s eyes are doing? This chapter describes an implementation in Apache Spark of a new algorithm for studying trends numerically: TrendCalculus.

Chapter 13, Secure Data, throughout this book we visit many areas of data science, often straying into those that are not traditionally associated with a data scientist’s core working knowledge. In this chapter we will visit another of those often overlooked fields, Secure Data; more specifically, how to protect your data and analytic results at all stages of the data life cycle. Core to this chapter is the construction of a commercial grade encryption codec for Spark.

Chapter 14, Scalable Algorithms, in this chapter we learn about why sometimes even basic algorithms, despite working at small scale, will often fail in “big data”. We’ll see how to avoid issues when writing Spark jobs that run over massive Datasets and will learn about the structure of algorithms and how to write custom data science analytics that scale over petabytes of data. The chapter features areas such as: parallelization strategies, caching, shuffle strategies, garbage collection optimization and probabilistic models; explaining how these can help you to get the most out of the Spark paradigm.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.163.13