Summary

Apache Spark in the cloud is the perfect solution for data scientists and data engineers who want to concentrate on getting the actual work done without being concerned about the operation of an Apache Spark cluster.

We saw that Apache Spark in the cloud is much more than just installing Apache Spark on a couple of virtual machines. It comes as a whole package for the data scientist, completely based on open-source components, which makes it easy to migrate to other cloud providers or to local datacenters if necessary.

We also learned that, in a typical data science project, the variety of skills is huge, which is taken care of by supporting all common programming languages and open-source data analytics frameworks on top of Apache Spark and Jupyter notebooks, and by completely eliminating the necessity for operational skills required to maintain the Apache Spark cluster.

Sometimes just one level of increased access to the underlying infrastructure is necessary. Maybe some specific versions of patch levels of software are needed or very specific configuration settings need to be used. So normally this would be the cast for going back to IaaS and installing everything on our own. But this is not really necessary; fortunately there is a better solution that still provides a fair amount of automatizations and still pushing the hard parts of operations to the cloud provider.

This and more is explained in the next Chapter, Apache Spark on Kubernetes.

Table of Contents for Summary

Create new playlist

Sign In

Sign Up

Table of Contents for
Summary