Chapter 1.  The Big Data Science Ecosystem

As a data scientist, you'll no doubt be very familiar with handling files and processing perhaps even large amounts of data. However, as I'm sure you will agree, doing anything more than a simple analysis over a single type of data requires a method of organizing and cataloguing data so that it can be managed effectively. Indeed, this is the cornerstone of a great data scientist. As the data volume and complexity increases, a consistent and robust approach can be the difference between generalized success and over-fitted failure!

This chapter is an introduction to an approach and ecosystem for achieving success with data at scale. It focuses on the data science tools and technologies. It introduces the environment, and how to configure it appropriately, but also explains some of the nonfunctional considerations relevant to the overall data architecture. While there is little actual data science at this stage, it provides the essential platform to pave the way for success in the rest of the book.

In this chapter, we will cover the following topics:

  • Data management responsibilities
  • Data architecture
  • Companion tools

Introducing the Big Data ecosystem

Data management is of particular importance, especially when the data is in flux; either constantly changing or being routinely produced and updated. What is needed in these cases is a way of storing, structuring, and auditing data that allows for the continuous processing and refinement of models and results.

Here, we describe how to best hold and organize your data to integrate with Apache Spark and related tools within the context of a data architecture that is broad enough to fit the everyday requirement.

Data management

Even if, in the medium term, you only intend to play around with a bit of data at home; then without proper data management, more often than not, efforts will escalate to the point where it is easy to lose track of where you are and mistakes will happen. Taking the time to think about the organization of your data, and in particular, its ingestion, is crucial. There's nothing worse than waiting for a long running analytic to complete, collating the results and producing a report, only to discover you used the wrong version of data, or data is incomplete, has missing fields, or even worse you deleted your results!

The bad news is that, despite its importance, data management is an area that is consistently overlooked in both commercial and non-commercial ventures, with precious few off-the-shelf solutions available. The good news is that it is much easier to do great data science using the fundamental building blocks that this chapter describes.

Data management responsibilities

When we think about data, it is easy to overlook the true extent of the scope of the areas we need to consider. Indeed, most data "newbies" think about the scope in this way:

  1. Obtain data
  2. Place the data somewhere (anywhere)
  3. Use the data
  4. Throw the data away

In reality, there are a large number of other considerations, it is our combined responsibility to determine which ones apply to a given work piece. The following data management building blocks assist in answering or tracking some important questions about the data:

  • File integrity
    • Is the data file complete?
    • How do you know?
    • Was it part of a set?
    • Is the data file correct?
    • Was it tampered with in transit?

  • Data integrity
    • Is the data as expected?
    • Are all of the fields present?
    • Is there sufficient metadata?
    • Is the data quality sufficient?
    • Has there been any data drift?

  • Scheduling
    • Is the data routinely transmitted?
    • How often does the data arrive?
    • Was the data received on time?
    • Can you prove when the data was received?
    • Does it require acknowledgement?

  • Schema management
    • Is the data structured or unstructured?
    • How should the data be interpreted?
    • Can the schema be inferred?
    • Has the data changed over time?
    • Can the schema be evolved from the previous version?

  • Version Management
    • What is the version of the data?
    • Is the version correct?
    • How do you handle different versions of the data?
    • How do you know which version you're using?

  • Security
    • Is the data sensitive?
    • Does it contain personally identifiable information (PII)?
    • Does it contain personal health information (PHI)?
    • Does it contain payment card information (PCI)?
    • How should I protect the data?
    • Who is entitled to read/write the data?
    • Does it require anonymization/sanitization/obfuscation/encryption?

  • Disposal
    • How do we dispose of the data?
    • When do we dispose of the data?

If, after all that, you are still not convinced, before you go ahead and write that bash script using the gawk and crontab commands, keep reading and you will soon see that there is a far quicker, flexible, and safer method that allow you to start small and incrementally create commercial grade ingestion pipelines!

The right tool for the job

Apache Spark is the emerging de facto standard for scalable data processing. At the time of writing this book, it is the most active Apache Software Foundation (ASF) project and has a rich variety of companion tools available. There are new projects appearing every day, many of which overlap in functionality. So it takes time to learn what they do and decide whether they are appropriate to use. Unfortunately, there's no quick way around this. Usually, specific trade-offs must be made on a case-by-case basis; there is rarely a one-size-fits-all solution. Therefore, the reader is encouraged to explore the available tools and choose wisely!

Various technologies are introduced throughout this book, and the hope is that they will provide the reader with a taster of some of the more useful and practical ones to a level where they may start utilizing them in their own projects. And further, we hope to show that if the code is written carefully, technologies may be interchanged through clever use of Application Program Interface (APIs) (or high order functions in Spark Scala) even when a decision is proved to be incorrect.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.148.144.100