Chapter 3. Input Formats and Schema

The aim of this chapter is to demonstrate how to load data from its raw format onto different schemas, therefore enabling a variety of different kinds of downstream analytics to be run over the same data. When writing analytics, or even better, building libraries of reusable software, you generally have to work with interfaces of fixed input types. Therefore, having flexibility in how you transition data between schemas, depending on the purpose, can deliver considerable downstream value, both in terms of widening the type of analysis possible and the re-use of existing code.

Our primary objective is to learn about the data format features that accompany Spark, although we will also delve into the finer points of data management by introducing proven methods that will enhance your data handling and increase your productivity. After all, it is most likely that you will be required to formalize your work at some point, and an introduction to how to avoid the potential long-term pitfalls is invaluable when writing analytics, and long after.

With this is mind, we will use this chapter to look at the traditionally well understood area of data schemas. We will cover key areas of traditional database modeling and explain how some of these cornerstone principles are still applicable to Spark.

In addition, while honing our Spark skills, we will analyze the GDELT data model and show how to store this large dataset in an efficient and scalable manner.

We will cover the following topics:

  • Dimensional modeling: benefits and weaknesses in relation to Spark
  • Focus on the GDELT model
  • Lifting the lid on schema-on-read
  • Avro object model
  • Parquet storage model

Let's start with some best practice.

A structured life is a good life

When learning about the benefits of Spark and big data, you may have heard discussions about structured data versus semi-structured data versus unstructured data. While Spark promotes the use of structured, semi-structured, and unstructured data, it also provides the basis for its consistent treatment. The only constraint being that it should be record-based. Providing they are record-based, datasets can be transformed, enriched and manipulated in the same way, regardless of their organization.

However, it is worth noting that having unstructured data does not necessitate taking an unstructured approach. Having identified techniques for exploring datasets in the previous chapter, it would be tempting to dive straight into stashing data somewhere accessible and immediately commencing simple profiling analytics. In real life situations, this activity often takes precedence over due diligence. Once again, we would encourage you to consider several key areas of interest, for example, file integrity, data quality, schedule management, version management, security, and so on, before embarking on this exploration. These should not be ignored and many are large topics in their own right.

Therefore, while we have already covered many of these concerns in Chapter 2, Data Acquisition, and will study more later, for example in Chapter 13, Secure Data, in this chapter we are going to focus on data input and output formats specifically, exploring some of the methods that we can employ to ensure better data handling and management.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.221.165.115