Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

3 Identifying the Right Data Platform

In the previous chapter, we discussed the various data types, their formats, and their storage. We also covered different databases and provided an overview of them. Then, we understood the factors and features we should compare when choosing a data format, storage type, or database for any use case to solve a data engineering problem effectively.

In this chapter, we will look at the various kinds of popular platforms that are available to run data engineering solutions. You will also learn about the considerations you should make as an architect to choose one of them. To do so, we will discuss the finer details of each platform and the alternatives these platforms provide. Finally, you will learn how to make the most of these platforms to architect an efficient, robust, and cost-effective solution for a business problem.

In this chapter, we’re going to cover the following main topics:

Virtualization and containerization platforms
Hadoop platforms
Cloud platforms
Choosing the correct platform

Technical requirements

To complete this chapter, you’ll need the following:

JDK 1.8 or above
Apache Maven 3.3 or above

The code for this chapter can be found in this book’s GitHub repository: https://github.com/PacktPublishing/Scalable-Data-Architecture-with-Java/tree/main/Chapter03

Virtualization and containerization platforms

With the spread of information technology (IT) in all spheres of life, the dependency and reliability on IT infrastructure have increased manifold. Now, IT runs so many critical and real-time businesses. This means that there can be zero or negligible downtime for maintenance or failure. Also, rapid real-time demands have grown. For example, during the holiday season, there’s a huge amount of traffic on online shopping websites. So, now, IT needs to be highly available, elastic, flexible, and quick. These were the reasons that motivated the creation of virtual platforms such as virtualization and containerization. For example, Barclays, a multinational financial firm based in the UK, was facing a hard time from competitors due to their slow pace of innovation and project deliveries. One of its major roadblocks was the time it took to provision new servers. So, they decided to use Red Hat OpenShift to containerize their application. This reduced the provisioning time dramatically from weeks to hours. As a result, time to market became super fast, which helped Barclays stay ahead of its competitors.

Virtualization abstracts the hardware and allows you to run multiple operating systems on a single server or piece of hardware. It uses software to create a virtual abstraction over the hardware resources so that multiple virtual machines (VMs) can run over the physical hardware with their virtual OS, virtual CPU (vCPU), virtual storage, and virtual networking. The following diagram shows how virtualization works:

Figure 3.1 – Virtualization

As shown in the preceding diagram, VMs run on a host machine with the help of a hypervisor. A hypervisor is a piece of software or firmware that can host a VM on physical hardware such as a server or a computer. The physical machine where hypervisors create VMs are called host machines and the VMs are called guest machines. The operating system in the host machine is called the host OS, while the operating system in the VMs is called the guest OS.

Benefits of virtualization

The following are the benefits of virtualization:

Better resource utilization: As multiple VMs run on the same hardware, hardware resources such as storage/memory and network can be more efficiently used to serve more applications that might have high loads at different times. Since spawning a VM is much quicker than spawning a new server. VMs can be spawned during high demand load cycles and switched off when the load on the application comes down.
Less downtime/higher availability: When physical servers have issues or go down, need routine maintenance, or require upgrades, it results in costly downtime. With virtual servers, applications can readily move between guest hosts to make sure there is minimal downtime in the order of minutes rather than hours or days.
Quicker to market and scalability: Since provisioning a VM takes minutes rather than weeks or months, the overall software delivery cycles have reduced substantially. This enables quicker testing since you can mock up production environment using VMs.
Faster disaster recovery (DR): Unlike physical servers, whose DR takes hours or days, VMs can recover within minutes. Hence, VMs enable us to have faster DR.

The following are a few examples of popular VMs:

Microsoft’s Hyper-V
VMware’s vSphere
Oracle’s VirtualBox

Let’s see how a VM works. We will start this exercise by downloading and installing Oracle VirtualBox:

Based on your host operating system, you can download the appropriate installer of Oracle VirtualBox from https://www.virtualbox.org/.

Then, install VirtualBox using the installation instructions at https://www.virtualbox.org/manual/ch02.html. These instructions are likely to vary by OS.

Once it has been installed, open Oracle VirtualBox. You will see the Oracle VM VirtualBox Manager home page, as shown here:

Figure 3.2 – The Oracle VirtualBox Manager home page

Then, click the New button to create a new VM on your machine (here, this serves as a guest OS). The following screenshot shows the Create Virtual Machine dialog screen popup (which appears upon clicking the New button):

Figure 3.3 – Configuring the guest VM using Oracle VirtualBox

In the preceding screenshot, you can see that you need to provide a unique name for the VM. You can also select the OS type and its version, as well as configure the memory (RAM) size. Finally, you can choose to configure or add a new virtual hard disk. If you choose to add a new hard disk, then a popup similar to the following will appear:

Figure 3.4 – Creating a virtual hard disk using Oracle VirtualBox

As shown in the preceding screenshot, when configuring a virtual hard disk, you can choose from various kinds of available virtual hard disk drives. The major popular virtual hard disks are as follows:

VirtualBox Disk Image (VDI)
Virtual Hard Disk (VHD)
Virtual Machine Disk (VMDK)

Once you have configured your desired virtual hard disk configuration, you can create the VM by clicking the Create button.

Once a VM has been created, it will be listed on the Oracle VM VirtualBox Manager screen, as shown in the following screenshot. You can start the virtual machine by selecting the appropriate VM and clicking the Start button:

Figure 3.5 – Guest VM created and listed in Oracle VirtualBox

Although VMs simplify our delivery and make the platform more available and quicker than traditional servers, they have some limitations:

VMs are heavyweight components: This means that every time you are doing a disaster recovery, you need to acquire all the resources and boot the guest operating system so that you can run your application. It takes a few minutes to reboot. Also, it is resource-heavy to boot a new guest OS.
They slow down the performance of the OS: Since there are only a few resources in VMs, and they can only be thick provisioned, this slows down the performance of the host OS, which, in turn, affects the performance of the guest OS.
Limited portability: Since the applications running on the VMs are tightly coupled to the guest OS, there will always be a portability issue when moving to a different guest OS with a different type or configuration.

Containerization can help us overcome these shortcomings. We’ll take a look at containerization in the next section.

Containerization

Containerization is a technique that abstracts the OS (instead of the hardware) and lets applications run on top of it directly. Containerization is more efficient than virtualization as applications don’t need a guest OS to run. Applications use the same kernel of the host OS to run multiple applications targeted for different types of OS. The following diagram shows how containerization works:

Figure 3.6 – Containerization

In containerization, a piece of software called a container engine runs on the host OS. This allows applications to run on top of the container engine, without any need to create a separate guest OS. Each running instance of the application, along with its dependencies, is called a container. Here, the application, along with its dependencies, can be bundled into a portable package called an image.

Benefits of containerization

The following are the advantages of containerization over virtualization:

Lightweight: Containers use dependencies and binaries to run applications directly on a container engine. Containers don’t need to create VMs, so they don’t need to initialize a dedicated virtual memory/hard disk to run the application. Containers boot significantly faster than VMs. While VMs take minutes to boot, containers can boot in seconds.
Portable: Applications, along with their dependencies and base container, can be bundled in a package called an image that can easily be ported across any container engine run on any kind of host.
Reduces single points of failure: Due to the easy portability and lightweight nature of containers, testing, deploying, and scaling applications has become easier. This has led to the development of microservices, which ensures reduced single points of failure.
Increased development velocity: In containerization, containers can be seamlessly migrated from one environment to another, enabling seamless continuous deployments. It also enables on-the-fly testing while building and packaging, hence improving continuous integration workflows. Application scaling becomes super fast if the application is run on containers. These features have made development easier and faster, enabling businesses to deliver solutions to the market quickly.

Docker is the most popular container engine. Let’s look at some of the most important and common terms related to Docker:

Docker image: A Docker image is a blueprint or template with instructions to create a Docker container. We can create a Docker image by bundling an already existing image with an application and its dependencies.
Docker container: A Docker container is a running instance of a Docker image. A Docker container contains a write layer on top of one or more read layers. The writable layer allows us to write anything on the container, as well as execute commands.
Docker registry: This is a repository that stores Docker images developed and uploaded by developers to be leveraged by other developers. Container repositories are physical locations where your Docker images are stored. Related images with the same name can be also stored, but each image will be uniquely identified by a tag. Maven repositories are to Java artifacts what Docker registries are to Docker images. Just like a Maven repository supports multiple versions of the related JAR files with the same name, a Docker registry supports multiple tags for images with the same name. Docker Hub is the official cloud-based public Docker registry.
Docker networking: Docker networking is responsible for communication between the Docker host and Docker applications. It is also responsible for basic inter-container communication. External applications and developers can access applications running in a Docker container on the port exposed to the external world.
Docker storage: Docker has multiple storage drivers that allow you to work with the underlying storage devices, such as Device Mapper, AUFS, and Overlay. Data volumes can be shared across multiple Docker containers. This enables shared resources to be stored in such data volumes. However, there is no way to share memory between Docker containers.

Now that we have learned about the important terminologies related to Docker, let’s learn how to set up Docker in a local machine using Docker Desktop. We will also show you how to deploy an image and start a Docker container:

First, you must install Docker Desktop on our local machine. You can download the appropriate Docker Desktop version from the following link based on your OS type and version: https://www.docker.com/products/docker-desktop.

Based on your operating system, you can follow the installation instructions at https://docs.docker.com/desktop/mac/install/ (for Mac) or https://docs.docker.com/desktop/windows/install/ (for Windows).

If you don’t have Maven installed in your system, please download and install it (instructions for installing Maven can be found at https://maven.apache.org/install.html).

Once you have installed Docker Desktop, open it. You will be asked to accept an agreement. Please read and accept it. Once you have done this and the application opens, you will see the following home page:

Figure 3.7 – Docker Desktop home page

Next, you must create a Docker Hub personal account to make use of Docker Desktop efficiently. Please sign up to create a personal Docker account at https://hub.docker.com/signup.

Once you have successfully created your account, click the Sign In button and enter your Docker ID and password to log in, as shown in the following screenshot:

Figure 3.8 – Logging into Docker Desktop

Now, let’s build our own Docker file. To build a Docker file, we need to know basic Docker build commands. The following table lists a few important Docker build commands:

Figure 3.9 – Docker build commands

To build the Docker file, first, download the code from https://github.com/PacktPublishing/Scalable-Data-Architecture-with-Java/blob/main/Chapter03/sourcecode/DockerExample. In this project, we will create a simple REST API using Spring Boot and deploy and run this application using our Docker environment locally. The artifact that will be generated when we build this project is DockerExample-1.0-SNAPSHOT.jar. The Docker file will look as follows:

# Each step creates a read-only layer of the image.

# For Java 8

FROM openjdk:8-jdk-alpine

# cd /opt/app

WORKDIR /opt/app

# cp target/DockerExample-1.0-SNAPSHOT.jar /opt/app/app.jar

COPY target/DockerExample-1.0-SNAPSHOT.jar app.jar

# exposing the port on which application runs

EXPOSE 8080

# java -jar /opt/app/app.jar

ENTRYPOINT ["java","-jar","/app.jar"]

In step 1 of the Docker file’s source code, we import a base image from Docker Hub. In step 2, we set the working directory inside the Docker container as /opt/app. In the next step, we copy our artifact to the working directory on Docker. After that, we expose port 8080 from Docker. Finally, we execute the Java app using the java -jar command.

Next, we will build the JAR file using Maven. First, using the command line (Windows) or a Terminal (Mac), go to the root folder of the DockerExample project. Run the following command to build the JAR file from the root folder of the project:
> mvn clean install
Then, run the following command to create a customized Docker image named hello-docker from the Docker file we just created:
> docker build --tag=hello-docker:latest .

Once you run this command, you will be able to see the Docker image in Docker Desktop, in the Images tab, as shown in the following screenshot:

Figure 3.10 – Docker container created successfully and listed

Now, you can start the container by clicking the RUN button, as follows:

Figure 3.11 – Running a Docker container

Provide a container name and a host port value and click Run in the popup dialog to start the container, as shown in the following screenshot:

Figure 3.12 – Setting up Docker Run configurations

Once you click Run, the container will be instantiated and you will be able to see the container listed (with its status set to RUNNING) in the Containers / Apps tab of Docker Desktop, as shown in the following screenshot:

Figure 3.13 – Running instance on Docker Desktop

Now, you can validate the app by testing it in a browser. Please make sure you are using the host port (configured during container creation) in the HTTP address while validating the application. For our example, you can validate the application using port 8887 (which you mapped earlier), as follows:

Figure 3.14 – Testing the app deployed on Docker

You can log into the Docker CLI using the CLI button in Docker Desktop, as follows:

Figure 3.15 – Opening the Docker CLI from Docker Desktop

In this section, we learned about Docker. While Docker makes our lives easy and makes development and deployment considerably faster, it comes with the following set of challenges:

Inter-container communication is usually not possible or very complex to set up
There is no incoming traffic distribution mechanism, which might cause a skewed distribution of incoming traffic to a set of containers
Container management is overhead to manage the cluster manually
Auto-scaling is not possible

In a production environment, we need to solve these shortcomings if we want to run a robust, efficient, scalable, and cost-effective solution. Here, container orchestrators come to the rescue. There are many container orchestrators on the market. However, Kubernetes, which was developed and open-sourced by Google, is one of the most popular and widely used container orchestrators. In the next section, we will discuss Kubernetes in more detail.

Kubernetes

Kubernetes is an open source container orchestrator that effectively manages containerized applications and their inter-container communication. It also automates how containers are deployed and scaled. Each Kubernetes cluster has multiple components:

Master
Nodes
Kubernetes objects (namespaces, pods, containers, volumes, deployments, and services)

The following diagram shows the various components of a Kubernetes cluster:

Figure 3.16 – A Kubernetes cluster and its components

Now, let’s briefly describe each of the components shown in the preceding diagram:

Kubernetes namespace: This provides logical segregation between different applications for different teams that are built for different purposes. Kube-System, Kube-public, and Kube-node-release are namespaces that are used by Kubernetes internal systems to manage the cluster, such as reading heartbeats, and to store publicly accessible data, such as configMap. Apart from this, there is a default namespace.
User namespace: Users/teams can create a namespace inside the default namespace.
Master nodes: The master node is the cluster orchestrator. It scales and allocates app containers whenever a new request for a deployment comes into the Kubernetes cluster.
Worker nodes: These are the nodes where the pods are deployed and the applications run.
Pod: A pod is an abstraction on top of a container that helps it become easily portable between different runtimes, auto-detects the available ports in a cluster, and gives it a unique IP address. A pod can consist of multiple containers in which multiple helper containers can communicate seamlessly and assist the primary application. These multiple containers in a single pod not only share volumes but also share memory spaces such as Portable Operating System Interface (POSIX) shared memory.
Agents: There are two types of agents, as follows:
- Kubelet agent: This is a service that runs in each node. It ensures all the containers within that node are up and running.
- Docker agent: This is a service that is used to run a container.

Now that we have briefly seen the components of Kubernetes and its role in containerization, let’s try to deploy the Docker image we created in the previous section in a Kubernetes cluster locally. To do that, we must install minikube (a Kubernetes cluster for running Kubernetes on your local machine):

You can install the appropriate version of minikube by following the instructions at https://minikube.sigs.k8s.io/docs/start/.
Once minikube has been installed, you can start minikube using the following command:
> minikube start

A successful minikube start looks as follows:

Figure 3.17 – Starting minikube

Now, just like we have to create a Dockerfile to create a Docker image, we have to create a YAML file to give deployment instructions to the Kubernetes cluster. In our project, we will name this YAML file deployment.yaml. The following is the code in the deployment.yaml file:
apiVersion: v1
kind: Service
metadata:
  name: hello-docker-service
spec:
  selector:
    app: hello-docker-app
  ports:
    - protocol: "TCP"
      port: 8080
      targetPort: 8080
      nodePort: 30036
  type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: hello-docker-app
spec:
  selector:
    matchLabels:
      app: hello-docker-app
  replicas: 5
  template:
    metadata:
      labels:
        app: hello-docker-app
    spec:
      containers:
        - name: hello-docker-app
          image: hello-docker-app
          imagePullPolicy: Never
          ports:
            - containerPort: 8080

The deployment.yaml file contains two types of configuration: one for Service and one for Deployment. Each Kubernetes component configuration consists of mainly three parts:

metadata consists of the name and any other meta information.
spec contains the specification. This is directly dependent on the kind of component that is being configured.
status is not something we have to configure. The Kubernetes cluster adds that part and keeps updating it after the deployment is done.

First, you have to build the hello-docker-app image and expose it to the minikube Docker environment. You can do that by executing the following commands:
> eval $(minikube docker-env)
> docker build --tag hello-docker-app:latest .
Now, you can deploy this application in a Kubernetes cluster using the following command from the project root folder:
> kubectl apply -f deployment.yaml

After executing this command, you should be able to see that hello-docker-service and hello-docker-app were created successfully, as shown in the following screenshot:

Figure 3.18 – Applications created in Kubernetes cluster

You can also check the deployments and their status in the minikube dashboard by executing the following command:
> minikube dashboard

Once executed, you should see the dashboard appear in your default browser. Here, you will be able to see the status of your deployment, as well as other monitoring information:

Figure 3.19 – The minikube dashboard

Now, you can access the deployed application and start its services using the following command:
> minikube start service: hello-docker-service

Once you have started the service, you can check the base URL that’s been exposed by the Docker service linked to your application by executing the following command:

> minikube service --url hello-docker-service

This command will show an output similar to the following:

Figure 3.20 – Checking the base URL

You can verify the running application from your browser by using the http://127.0.0.1:63883/hello URL for this example, as follows:

Figure 3.21 – Testing app deployed using Kubernetes

In this section, we discussed how virtualization and containerization can help you manage, deploy, and develop an application in more effective, faster, and cost-optimized ways. General web applications, backend applications, and other processing applications work extremely well on scalable virtual platforms such as containers and VMs. However, big data, which amounts to terabytes and petabytes of data, requires a platform with a different kind of architecture to perform well. From the next section onwards, we will discuss platforms that are apt for big data processing.

Hadoop platforms

With the advent of search engines, social networks, and online marketplaces, data volumes grew exponentially. Searching and processing such data volumes needed a different approach to meet the service-level agreements (SLAs) and customer expectations. Both Google and Nutch used a new technology paradigm to solve this problem, thus storing and processing data in a distributed way automatically. As a result of this approach, Hadoop was born in 2008 and has proved to be a lifesaver for storing and processing huge volumes (in the order of terabytes or more) of data efficiently and quickly.

Apache Hadoop is an open source framework that enables distributed storage and processing of large datasets across a cluster of computers. It is designed to scale from a single server to thousands of machines easily. It provides high availability by having strong node failover and recovery features, which enables a Hadoop cluster to run on cheap commodity hardware.

Hadoop architecture

In this section, we will discuss the architecture and various components of a Hadoop cluster. The following diagram provides a top-level overview of the Hadoop ecosystem:

Figure 3.22 – Hadoop ecosystem overview

As shown in the preceding diagram, the Hadoop ecosystem consists of three separate layers, as discussed here:

Storage layer: The storage layer in Hadoop is known as Hadoop Distributed File System (HDFS). HDFS supports distributed and replicated storage of large datasets, which provides high availability and high-performance access to data. The following diagram provides an overview of the HDFS architecture:

Figure 3.23 – HDFS architecture

HDFS has a master-slave architecture where the NameNode is the master and all DataNodes are the slaves. The NameNode is responsible for storing metadata about all the files and directories in HDFS. It is also responsible for storing a mapping of which blocks are stored in which DataNode. There is a secondary NameNode that is responsible for the housekeeping jobs of the NameNode such as compaction. DataNodes are the real horsepower in an HDFS system. They are responsible for storing block-level data and performing all the necessary block-level operations on it. The DataNode sends periodical signals called heartbeats to the NameNode to specify that they are up and running. It also sends a block report to the NameNode every tenth heartbeat.

When the client makes a read request, it gets the metadata information about the files and blocks from the NameNode. Then, it fetches the required blocks from the correct DataNode(s) using this metadata. When a client makes a write call, the data gets written into distributed blocks across various DataNodes. These blocks are then replicated across the nodes (on a different rack) for high availability in case there is an outage in the current rack.

Resource manager layer: The resource manager is a framework that manages cluster resources and is also responsible for scheduling Hadoop jobs. The following diagram shows how the resource manager works:

Figure 3.24 – How resource manager works

As we can see, each client sends a request to the Resource Manager when they submit a processing job in Hadoop. The Resource Manager consists of a Scheduler and an Application Manager. Here, the Application Manager is responsible for negotiating with the application’s master container by communicating with node managers in different data nodes. Each application master is responsible for executing a single application. Then, the Scheduler in the Resource Manager is responsible for negotiating other container resources by interacting with the node managers based on the resource requests from the ApplicationMaster. Two of the most popular resource managers in Hadoop are Apache YARN and Apache Mesos.

Processing layer: The processing layer is responsible for the parallel processing of distributed datasets in the Hadoop ecosystem. Two of the most popular Hadoop processing engines are MapReduce and Apache Spark. MapReduce programs are tightly coupled to the Hadoop environment. It primarily processes data using two mandatory phases – the map phase and the reduce phase – and uses several optional data processing phases. It writes intermediate data back to HDFS between these phases. On the other hand, Spark reads the distributed data in a logically distributed dataset called Resilient Distributed Dataset (RDD) and creates a Directed Acyclic Graph (DAG) consisting of stages and tasks to process the data. Since it doesn’t write the intermediate data to disk (unless shuffling is explicitly required), it usually works 10 times faster than MapReduce.

Although these three layers are interdependent, the design is such that the layers are decoupled from each other. This decoupled layer architecture makes Hadoop more flexible, powerful, and extendable. This is why Hadoop processing has improved and evolved, even though the sizes of datasets have grown at a tremendous rate and expected SLAs to process data have reduced over time.

Although Hadoop is an open source framework, all production Hadoop clusters run in one of the following distributions:

Hortonworks Data Platform or HDP (discontinued)
Cloudera Distribution of Hadoop or CDH (discontinued)
Cloudera Data Platform (both HDP and CDH can migrate to this platform after the Hortonworks and Cloudera merger)
MapR Distributions

Apart from these distributions, which are meant for on-premise Hadoop deployments, some popular cloud distributions for Hadoop are available:

CDP Public cloud from Cloudera
Elastic MapReduce (EMR) from Amazon Web Services (AWS)
HDInsight from Microsoft Azure
Cloud Dataproc from Google Cloud Platform (GCP)

In this section, we briefly discussed Hadoop distributions and how they work. We also covered the various Hadoop distributions that are available from various vendors for running Hadoop in a production environment.

As data keeps growing, there is a need to grow the on-premise infrastructure. Such infrastructure capacities need to be planned to support the maximum load. This creates either underutilization of resources or overutilization of resources if an unexpected load occurs. The answer to this problem is cloud computing. In the next section, we will discuss various cloud platforms and the benefit they bring to data engineering solutions.

Cloud platforms

Cloud computing involves delivering computing services such as storage, compute, networking, and intelligence over the internet. It offers a pay-as-you-go model, which means you only pay for the service you use. This helps cut down on your operating costs, as well as capital expenditure (CapEx) costs. Cloud enables optimal resource utilization, instant scalability, agility, and ease of maintenance, enabling faster innovation and economies of scale. For example, Canva is a design tool that anyone can access via its simple user interface. In 2019, it had 55 million users. At the time of writing, it has 85 million users worldwide, creating 100+ designs per second. To accommodate this exponential customer and data volume growth seamlessly with similar or better performance, Canva uses the AWS platform.

The following cloud computing distributions are the market leaders in cloud computing and are often referred to as the Big 3 of cloud computing:

AWS by Amazon
Microsoft Azure by Microsoft
GCP by Google

Apart from the Big 3, there are other smaller or lesser-known cloud distributions such as Red Hat OpenShift, HPE GreenLake, and IBM Cloud.

Benefits of cloud computing

The following are the benefits of cloud computing:

Cost-effective: Cloud computing reduces the CapEx cost by eliminating the huge costs involved in the infrastructure setup. It also reduces cost by applying a pay-as-you-go model.
Scalable: Since cloud services are all virtualized, they can be spun up within minutes or seconds, enabling extremely fast scalability.
Elastic: Cloud services can be easily scaled up or down based on the resource and compute demand.
Reliable: Since each service is replicated across availability zones as well as regions, the services are highly reliable and guarantee minimum downtime.
Global: Since the cloud service is on the internet, the computation power can be delivered across the globe, from the right geographic location.
Increased productivity: Since provisioning, managing, and deploying resources and services are no longer headaches for developers, they can focus on business functionality and deliver solutions much faster and effectively.
Secure: Along with the other benefits, the cloud has a lot of security layers and services, making the cloud a secure infrastructure.

There are three types of cloud computing, as follows:

Public cloud: Public clouds are owned and operated by third-party vendors who deliver computing resources and services over the internet. Here, as a user of the public cloud, you must pay for what you use.
Private cloud: A private cloud is a form of cloud computing where the computing resources are owned by the customer, usually in a private on-premise data center. The cloud provider only provides the cloud software and its support. Usually, this is used by big enterprises where security and compliance are constraints for moving toward the public cloud. Here, the customers are responsible for managing and monitoring cloud resources.
Hybrid cloud: Hybrid clouds combine both public and private clouds, bound together by technology through which data and applications can communicate and move seamlessly between private and public clouds. This provides higher flexibility when it comes to security, compliance, and agility.

Now that we have discussed the different types of cloud computing, let’s try to understand the various types of cloud services available in a public cloud distribution. The various types of cloud services are as follows:

Infrastructure as a Service (IaaS)
Platform as a Service (PaaS)
Software as a Service (SaaS)

In the cloud, the responsibility of owning various stacks in application development is shared between cloud vendors and the customers. The following diagram shows the shared responsibility model for these kinds of cloud computing services:

Figure 3.25 – Shared responsibility model

As we can see, if the customer is running a private cloud, all the resources, services, applications, and data are the customer’s responsibility. However, if you opt for a public cloud, then you can choose between IaaS, PaaS, and SaaS. Cloud vendors promise to manage and own infrastructure services such as compute, storage, and networking in an IaaS model. If you go for a PaaS model, apart from what you get in IaaS, cloud providers also manage the OS, VMs, and runtime so that you can own, develop, and manage your applications, data, and access. In SaaS, everything except data and access is managed by your cloud vendor. Even the application or software is managed by the cloud provider. Although this might be costlier if you take a single unit compared to the other two models, based on your business, it might be cheaper and more hassle-free.

With that, we have discussed the various platforms where data engineering applications may be deployed. Now, let’s discuss the various design choices that an architect needs to know to choose the correct platform for them.

Choosing the correct platform

In this section, we will look at one of the most important decisions architects have to make – how to choose the most suitable platform for a use case. Here, we will understand when to choose between virtualization versus containerization and on-premise versus the cloud when considering various cloud data platforms.

When to choose virtualization versus containerization

Although both these technologies ensure that we can use resources to the best of our ability by provisioning virtual resources, each has its advantages based on the type of application.

Microservices is a variant of the service-oriented architecture where an application is perceived as a collection of loosely coupled services. Each service is fine-grained and lightweight. Microservices are best suited for container-based platforms. For example, a REST service can be easily deployed using containers. Since microservices consist of loosely coupled services, they should be easily deployable and scalable. Since each service can be independently consumed and reused by other services and stacks, they need to be portable so that they can quickly migrate to any containerized platform.

On the other hand, monolithic applications are designed to perform multiple related tasks, but it is built as a tightly coupled single application. Such applications are more suited for small teams or Proof of Concept (POC) purposes. Another use case where such monolithic architectures are used is in legacy applications. Such monolithic applications are best suited for virtualization. Another use case where virtualized platforms are preferred over containerization is in any application that is dependent on an OS or talks directly to a specific OS.

However, in the cloud, all the servers that are provisioned are VMs. Containerized platforms such as Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS) run on top of virtual servers such as Amazon Elastic Compute Cloud (EC2). So, in modern architectures, especially in the cloud, the question is not choosing between containerization and virtualization – it is choosing between containerization along with virtualization versus virtualization.

When to use big data

If we are handling data that is terabytes or petabytes in size, big data is a good choice. As artificial intelligence (AI) and machine learning (ML) applications are growing in popularity, we need to deal with huge volumes of data – the larger the data, more accurate will be the AI models. These volumes of data run into the terabytes. Processing such data in a scalable fashion can be done by big data applications. There are use cases where, due to processing complexity, processing hundreds of gigabytes of data takes an unnecessarily long time. In such scenarios, big data may be a good solution. Most big data use cases are for analytics, Online Analytical Processing (OLAP), AI, and ML.

Choosing between on-premise versus cloud-based solutions

This is an obvious question that architects face today. In this section, we will try to see what factors affect this decision, as well as recommend a few general criteria to help you decide on one over the other. The factors that will help you decide between on-premise versus cloud solutions are as follows:

Cost: Enterprises are responsible for all infrastructure and maintenance costs, including human resources for on-premise environments. Costs also include migration costs from on-premise to the cloud. Another important cost metric is about CapEx versus operating expenses (OpEx) and how cost-efficient OpEx will be versus CapEx for the enterprise. All these factors determine the total cost of ownership, which ultimately determines what is best for your business.
Control: Enterprises are completely in control of the data, its storage, and all hardware related to the on-premise infrastructure. However, although enterprises own the data, the storage and its hardware are managed by the cloud provider.
Resource demand pattern: If the demand for resources is elastic and infrastructure demand is seasonal, then the cloud may be the correct choice. On the other hand, if resource demand is static, then opting for on-premise may be the correct option.
Agility and scalability: If your company is a start-up and growing exponentially, which means your demand scales up and down based on the feedback you receive and your volatile customer base, then the cloud will be a better choice for you.
Security: Security is a big concern for a few industries, such as finance and healthcare. Although there have been many advances in cloud security and they have a strong robust security model, since the data is stored in hardware managed by a public cloud provider, many such businesses with very sensitive data choose on-premise over the cloud for security reasons.
Compliance: Several industries have very strict regulatory controls and policies, such as federal agencies and healthcare. In such businesses, having complete control over the data and its storage makes more sense. Hence, on-premise options are more suitable.

Based on these factors, here are some broad guidelines that you can use to make this decision. However, note that these are only recommendations – the actual choice will depend on your specific business needs and context.

You should choose on-premise architectures in the following circumstances:

Security is a major concern and you don’t want to take the chance of any data risks occurring
Regulatory policies and controls are stringent, stipulating that control of data and its storage should remain within the organization
Legacy systems can’t easily be moved or replicated
The time, effort, and cost involved are not justifiable to migrate data and processing from on-premise to the cloud

You should choose cloud architectures in the following circumstances:

Flexibility and agility to scale and grow are needed
You are a start-up and you have a limited client base and limited CapEx, but you have high growth potential
You want dynamic configurations of the environment that can easily be modified on demand
You do not want to do a CapEx investment on infrastructure and prefer a pay-as-you-go model
You are uncertain about the business demand, and you need to scale your resources up and down frequently
You do not want to expend resources and time to maintain your infrastructure and the cost associated with it
You want an agile setup, quicker deliveries, and a faster turnaround time for operations

Finally, let’s compare the Big 3 cloud vendors to decide which provider is a best fit for your business.

Choosing between various cloud vendors

In this section, we will compare the Big 3 public cloud vendors and how they perform in various categories, even though there is no clear answer to the question, Which cloud vendor is best for my business? The following table provides a comparison between the Big 3 cloud providers and throws light on their strengths and weaknesses:

	AWS	Azure	GCP
Services	Huge range of services	Good range of services available. Exceptional services in AI/ML.	Limited services are available.
Maturity	Most mature	Catching up with AWS.	Still relatively less mature than the other two.
Marketplace	All vendors make their products available	Good vendor support but less than AWS.
Reliability	Excellent	Excellent.	Excellent.
Security	Excellent	Excellent.	Fewer notches than AWS and Azure.
Cost	Varies	Most cost-efficient.	Varies.

	AWS	Azure	GCP
Support	Paid dev/enterprise support	Paid dev/enterprise support. More support options than AWS.	Paid dev/premium support. Costlier support than the other two.
Hybrid Cloud Support	Limited	Excellent.	Good.
Special Notes	More compute capacity versus Azure and GCP	Easy integration and migrations for existing Microsoft services.	Excellent support for containerized workloads. Global fiber network.

Figure 3.26 – Comparison of the Big 3 cloud vendors

In short, AWS is the market leader but both Azure and GCP are catching up. If you are looking for the maximum number of services available across the globe, AWS will be your obvious choice, but it comes with a higher learning curve.

If your use case revolves only around AI/ML and you have a Microsoft on-premise infrastructure, Azure may be the correct choice. They have excellent enterprise support and hybrid cloud support. If you need a robust hybrid cloud infrastructure, Microsoft Azure is your go-to option.

GCP entered the race late, but they have excellent integration and support for open source and third-party services.

But in the end, it boils down to your specific use case. As the market is growing, most enterprises are looking for multi-cloud strategies to leverage the best of each vendor.

Now, let’s summarize this chapter.

Summary

In this chapter, we discussed various virtualization platforms. First, we briefly covered the architectures of the virtualization, containerization, and container orchestration frameworks. Then, we deployed VMs, Docker containers, and Kubernetes containers and ran an application on top of them. In doing so, we learned how to configure Dockerfiles and Kubernetes deployment scripts. After that, we discussed the Hadoop architecture and the various Hadoop distributions that are available on the market. Then, we briefly discussed cloud computing and its basic concepts. Finally, we covered the decisions that every data architect has to make: containers or VMs? Do I need big data processing? Cloud or on-premise? If the cloud, which cloud?

With that, we have a good understanding of some of the basic concepts and nuances of data architecting, including the basic concepts, databases, data storage, and the various platforms these solutions run on in production. In the next chapter, we will dive deeper into how to architect various data processing and data ingestion pipelines.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for Chapter 3: Identifying the Right Data Platform

Create new playlist

Sign In

Sign Up

3

Identifying the Right Data Platform