Summary

In this chapter, we have seen why datasets should always be thoroughly understood before too much exploration work is undertaken. We have discussed the details of structured data and dimensional modeling, particularly with respect to how this applies to the GDELT dataset, and have expanded the GKG model to show its underlying complexity.

We have explained the difference between the traditional ETL and newer schema-on-read ELT techniques, and have touched upon some of the issues that data engineers face regarding data storage, compression, and data formats - specifically the advantages and implementations of Avro and Parquet. We have also demonstrated that there are several ways to explore data using the various Spark API, including examples of how to use SQL on the Spark shell.

We can conclude this chapter by mentioning that the code in our repository pulls everything together and is a full model for reading in raw GKG files (use the Apache NiFi GDELT data ingest pipeline from Chapter 1, Data Acquisition if you require some data).

In the next chapter, we will dive deeper into the GKG model by exploring the techniques used to explore and analyze data at scale. We will see how to develop and enrich our GKG data model using SQL, and investigate how Apache Zeppelin notebooks can provide a richer data science experience.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.226.34.197