Keep your Dataproc clusters stateless

Remember that Hadoop in its pure, non-cloud form maintains state in a distributed file system named HDFS. HDFS is on the same set of nodes where the Hadoop jobs actually run; for this reason, Hadoop is said to not separate compute and storage. The compute (Hadoop Jars) and storage (HDFS data) are on the same machines, and the Jars are actually shipped to where the data is.

This was a fine pattern for the old days, but in the cloud world, if you kept your data in HDFS, you would run up an enormous bill. Why? Because in the world of elastic Hadoop clusters, such as Dataproc on the GCP or Elastic MapReduce on AWS, HDFS is going to exist on the persistent disks of the cloud VMs in the cluster. If you keep data in HDFS, you will need those disks to always exist; therefore, the cluster will always be up. You will pay a lot, use only a little, and basically negate the whole point of moving to the cloud.

So, what you really ought to do is move your data from on-premise HDFS to cloud GCS. Do not move from on-premise HDFS to cloud HDFS. That way, you can spin up clusters whenever you like, point them to data on the GCS buckets, run your job, and kill the cluster. Such clusters are named stateless because they only reference state data from an external source (GCS buckets) rather than maintaining it internally in HDFS.

Table of Contents for Keep your Dataproc clusters stateless

Create new playlist

Sign In

Sign Up

Table of Contents for
Keep your Dataproc clusters stateless