We will be using the same set of data as used before, that is, 2 million customer records, addresses, and contacts.
But before we proceed, let's clean the data created in previous chapters by following the steps explained here. Ensure the required processes are up and running for the cleanup, i.e. Hue, DFS, hiveserver2, Zookeeper and Kafka.