In the previous chapter, we discussed the various data types, their formats, and their storage. We also covered different databases and provided an overview of them. Then, we understood the factors and features we should compare when choosing a data format, storage type, or database for any use case to solve a data engineering problem effectively.
In this chapter, we will look at the various kinds of popular platforms that are available to run data engineering solutions. You will also learn about the considerations you should make as an architect to choose one of them. To do so, we will discuss the finer details of each platform and the alternatives these platforms provide. Finally, you will learn how to make the most of these platforms to architect an efficient, robust, and cost-effective solution for a business problem.
In this chapter, we’re going to cover the following main topics:
To complete this chapter, you’ll need the following:
The code for this chapter can be found in this book’s GitHub repository: https://github.com/PacktPublishing/Scalable-Data-Architecture-with-Java/tree/main/Chapter03
With the spread of information technology (IT) in all spheres of life, the dependency and reliability on IT infrastructure have increased manifold. Now, IT runs so many critical and real-time businesses. This means that there can be zero or negligible downtime for maintenance or failure. Also, rapid real-time demands have grown. For example, during the holiday season, there’s a huge amount of traffic on online shopping websites. So, now, IT needs to be highly available, elastic, flexible, and quick. These were the reasons that motivated the creation of virtual platforms such as virtualization and containerization. For example, Barclays, a multinational financial firm based in the UK, was facing a hard time from competitors due to their slow pace of innovation and project deliveries. One of its major roadblocks was the time it took to provision new servers. So, they decided to use Red Hat OpenShift to containerize their application. This reduced the provisioning time dramatically from weeks to hours. As a result, time to market became super fast, which helped Barclays stay ahead of its competitors.
Virtualization abstracts the hardware and allows you to run multiple operating systems on a single server or piece of hardware. It uses software to create a virtual abstraction over the hardware resources so that multiple virtual machines (VMs) can run over the physical hardware with their virtual OS, virtual CPU (vCPU), virtual storage, and virtual networking. The following diagram shows how virtualization works:
Figure 3.1 – Virtualization
As shown in the preceding diagram, VMs run on a host machine with the help of a hypervisor. A hypervisor is a piece of software or firmware that can host a VM on physical hardware such as a server or a computer. The physical machine where hypervisors create VMs are called host machines and the VMs are called guest machines. The operating system in the host machine is called the host OS, while the operating system in the VMs is called the guest OS.
The following are the benefits of virtualization:
The following are a few examples of popular VMs:
Let’s see how a VM works. We will start this exercise by downloading and installing Oracle VirtualBox:
Then, install VirtualBox using the installation instructions at https://www.virtualbox.org/manual/ch02.html. These instructions are likely to vary by OS.
Figure 3.2 – The Oracle VirtualBox Manager home page
Figure 3.3 – Configuring the guest VM using Oracle VirtualBox
In the preceding screenshot, you can see that you need to provide a unique name for the VM. You can also select the OS type and its version, as well as configure the memory (RAM) size. Finally, you can choose to configure or add a new virtual hard disk. If you choose to add a new hard disk, then a popup similar to the following will appear:
Figure 3.4 – Creating a virtual hard disk using Oracle VirtualBox
As shown in the preceding screenshot, when configuring a virtual hard disk, you can choose from various kinds of available virtual hard disk drives. The major popular virtual hard disks are as follows:
Once you have configured your desired virtual hard disk configuration, you can create the VM by clicking the Create button.
Figure 3.5 – Guest VM created and listed in Oracle VirtualBox
Although VMs simplify our delivery and make the platform more available and quicker than traditional servers, they have some limitations:
Containerization can help us overcome these shortcomings. We’ll take a look at containerization in the next section.
Containerization is a technique that abstracts the OS (instead of the hardware) and lets applications run on top of it directly. Containerization is more efficient than virtualization as applications don’t need a guest OS to run. Applications use the same kernel of the host OS to run multiple applications targeted for different types of OS. The following diagram shows how containerization works:
Figure 3.6 – Containerization
In containerization, a piece of software called a container engine runs on the host OS. This allows applications to run on top of the container engine, without any need to create a separate guest OS. Each running instance of the application, along with its dependencies, is called a container. Here, the application, along with its dependencies, can be bundled into a portable package called an image.
The following are the advantages of containerization over virtualization:
Docker is the most popular container engine. Let’s look at some of the most important and common terms related to Docker:
Now that we have learned about the important terminologies related to Docker, let’s learn how to set up Docker in a local machine using Docker Desktop. We will also show you how to deploy an image and start a Docker container:
Based on your operating system, you can follow the installation instructions at https://docs.docker.com/desktop/mac/install/ (for Mac) or https://docs.docker.com/desktop/windows/install/ (for Windows).
If you don’t have Maven installed in your system, please download and install it (instructions for installing Maven can be found at https://maven.apache.org/install.html).
Figure 3.7 – Docker Desktop home page
Once you have successfully created your account, click the Sign In button and enter your Docker ID and password to log in, as shown in the following screenshot:
Figure 3.8 – Logging into Docker Desktop
Figure 3.9 – Docker build commands
To build the Docker file, first, download the code from https://github.com/PacktPublishing/Scalable-Data-Architecture-with-Java/blob/main/Chapter03/sourcecode/DockerExample. In this project, we will create a simple REST API using Spring Boot and deploy and run this application using our Docker environment locally. The artifact that will be generated when we build this project is DockerExample-1.0-SNAPSHOT.jar. The Docker file will look as follows:
# Each step creates a read-only layer of the image.
# For Java 8
FROM openjdk:8-jdk-alpine
# cd /opt/app
WORKDIR /opt/app
# cp target/DockerExample-1.0-SNAPSHOT.jar /opt/app/app.jar
COPY target/DockerExample-1.0-SNAPSHOT.jar app.jar
# exposing the port on which application runs
EXPOSE 8080
# java -jar /opt/app/app.jar
ENTRYPOINT ["java","-jar","/app.jar"]
In step 1 of the Docker file’s source code, we import a base image from Docker Hub. In step 2, we set the working directory inside the Docker container as /opt/app. In the next step, we copy our artifact to the working directory on Docker. After that, we expose port 8080 from Docker. Finally, we execute the Java app using the java -jar command.
> mvn clean install
> docker build --tag=hello-docker:latest .
Once you run this command, you will be able to see the Docker image in Docker Desktop, in the Images tab, as shown in the following screenshot:
Figure 3.10 – Docker container created successfully and listed
Figure 3.11 – Running a Docker container
Provide a container name and a host port value and click Run in the popup dialog to start the container, as shown in the following screenshot:
Figure 3.12 – Setting up Docker Run configurations
Once you click Run, the container will be instantiated and you will be able to see the container listed (with its status set to RUNNING) in the Containers / Apps tab of Docker Desktop, as shown in the following screenshot:
Figure 3.13 – Running instance on Docker Desktop
Figure 3.14 – Testing the app deployed on Docker
You can log into the Docker CLI using the CLI button in Docker Desktop, as follows:
Figure 3.15 – Opening the Docker CLI from Docker Desktop
In this section, we learned about Docker. While Docker makes our lives easy and makes development and deployment considerably faster, it comes with the following set of challenges:
In a production environment, we need to solve these shortcomings if we want to run a robust, efficient, scalable, and cost-effective solution. Here, container orchestrators come to the rescue. There are many container orchestrators on the market. However, Kubernetes, which was developed and open-sourced by Google, is one of the most popular and widely used container orchestrators. In the next section, we will discuss Kubernetes in more detail.
Kubernetes is an open source container orchestrator that effectively manages containerized applications and their inter-container communication. It also automates how containers are deployed and scaled. Each Kubernetes cluster has multiple components:
The following diagram shows the various components of a Kubernetes cluster:
Figure 3.16 – A Kubernetes cluster and its components
Now, let’s briefly describe each of the components shown in the preceding diagram:
Now that we have briefly seen the components of Kubernetes and its role in containerization, let’s try to deploy the Docker image we created in the previous section in a Kubernetes cluster locally. To do that, we must install minikube (a Kubernetes cluster for running Kubernetes on your local machine):
> minikube start
A successful minikube start looks as follows:
Figure 3.17 – Starting minikube
apiVersion: v1
kind: Service
metadata:
name: hello-docker-service
spec:
selector:
app: hello-docker-app
ports:
- protocol: "TCP"
port: 8080
targetPort: 8080
nodePort: 30036
type: LoadBalancer
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: hello-docker-app
spec:
selector:
matchLabels:
app: hello-docker-app
replicas: 5
template:
metadata:
labels:
app: hello-docker-app
spec:
containers:
- name: hello-docker-app
image: hello-docker-app
imagePullPolicy: Never
ports:
- containerPort: 8080
The deployment.yaml file contains two types of configuration: one for Service and one for Deployment. Each Kubernetes component configuration consists of mainly three parts:
> eval $(minikube docker-env)
> docker build --tag hello-docker-app:latest .
> kubectl apply -f deployment.yaml
After executing this command, you should be able to see that hello-docker-service and hello-docker-app were created successfully, as shown in the following screenshot:
Figure 3.18 – Applications created in Kubernetes cluster
> minikube dashboard
Once executed, you should see the dashboard appear in your default browser. Here, you will be able to see the status of your deployment, as well as other monitoring information:
Figure 3.19 – The minikube dashboard
> minikube start service: hello-docker-service
Once you have started the service, you can check the base URL that’s been exposed by the Docker service linked to your application by executing the following command:
> minikube service --url hello-docker-service
This command will show an output similar to the following:
Figure 3.20 – Checking the base URL
Figure 3.21 – Testing app deployed using Kubernetes
In this section, we discussed how virtualization and containerization can help you manage, deploy, and develop an application in more effective, faster, and cost-optimized ways. General web applications, backend applications, and other processing applications work extremely well on scalable virtual platforms such as containers and VMs. However, big data, which amounts to terabytes and petabytes of data, requires a platform with a different kind of architecture to perform well. From the next section onwards, we will discuss platforms that are apt for big data processing.
With the advent of search engines, social networks, and online marketplaces, data volumes grew exponentially. Searching and processing such data volumes needed a different approach to meet the service-level agreements (SLAs) and customer expectations. Both Google and Nutch used a new technology paradigm to solve this problem, thus storing and processing data in a distributed way automatically. As a result of this approach, Hadoop was born in 2008 and has proved to be a lifesaver for storing and processing huge volumes (in the order of terabytes or more) of data efficiently and quickly.
Apache Hadoop is an open source framework that enables distributed storage and processing of large datasets across a cluster of computers. It is designed to scale from a single server to thousands of machines easily. It provides high availability by having strong node failover and recovery features, which enables a Hadoop cluster to run on cheap commodity hardware.
In this section, we will discuss the architecture and various components of a Hadoop cluster. The following diagram provides a top-level overview of the Hadoop ecosystem:
Figure 3.22 – Hadoop ecosystem overview
As shown in the preceding diagram, the Hadoop ecosystem consists of three separate layers, as discussed here:
Figure 3.23 – HDFS architecture
HDFS has a master-slave architecture where the NameNode is the master and all DataNodes are the slaves. The NameNode is responsible for storing metadata about all the files and directories in HDFS. It is also responsible for storing a mapping of which blocks are stored in which DataNode. There is a secondary NameNode that is responsible for the housekeeping jobs of the NameNode such as compaction. DataNodes are the real horsepower in an HDFS system. They are responsible for storing block-level data and performing all the necessary block-level operations on it. The DataNode sends periodical signals called heartbeats to the NameNode to specify that they are up and running. It also sends a block report to the NameNode every tenth heartbeat.
When the client makes a read request, it gets the metadata information about the files and blocks from the NameNode. Then, it fetches the required blocks from the correct DataNode(s) using this metadata. When a client makes a write call, the data gets written into distributed blocks across various DataNodes. These blocks are then replicated across the nodes (on a different rack) for high availability in case there is an outage in the current rack.
Figure 3.24 – How resource manager works
As we can see, each client sends a request to the Resource Manager when they submit a processing job in Hadoop. The Resource Manager consists of a Scheduler and an Application Manager. Here, the Application Manager is responsible for negotiating with the application’s master container by communicating with node managers in different data nodes. Each application master is responsible for executing a single application. Then, the Scheduler in the Resource Manager is responsible for negotiating other container resources by interacting with the node managers based on the resource requests from the ApplicationMaster. Two of the most popular resource managers in Hadoop are Apache YARN and Apache Mesos.
Although these three layers are interdependent, the design is such that the layers are decoupled from each other. This decoupled layer architecture makes Hadoop more flexible, powerful, and extendable. This is why Hadoop processing has improved and evolved, even though the sizes of datasets have grown at a tremendous rate and expected SLAs to process data have reduced over time.
Although Hadoop is an open source framework, all production Hadoop clusters run in one of the following distributions:
Apart from these distributions, which are meant for on-premise Hadoop deployments, some popular cloud distributions for Hadoop are available:
In this section, we briefly discussed Hadoop distributions and how they work. We also covered the various Hadoop distributions that are available from various vendors for running Hadoop in a production environment.
As data keeps growing, there is a need to grow the on-premise infrastructure. Such infrastructure capacities need to be planned to support the maximum load. This creates either underutilization of resources or overutilization of resources if an unexpected load occurs. The answer to this problem is cloud computing. In the next section, we will discuss various cloud platforms and the benefit they bring to data engineering solutions.
Cloud computing involves delivering computing services such as storage, compute, networking, and intelligence over the internet. It offers a pay-as-you-go model, which means you only pay for the service you use. This helps cut down on your operating costs, as well as capital expenditure (CapEx) costs. Cloud enables optimal resource utilization, instant scalability, agility, and ease of maintenance, enabling faster innovation and economies of scale. For example, Canva is a design tool that anyone can access via its simple user interface. In 2019, it had 55 million users. At the time of writing, it has 85 million users worldwide, creating 100+ designs per second. To accommodate this exponential customer and data volume growth seamlessly with similar or better performance, Canva uses the AWS platform.
The following cloud computing distributions are the market leaders in cloud computing and are often referred to as the Big 3 of cloud computing:
Apart from the Big 3, there are other smaller or lesser-known cloud distributions such as Red Hat OpenShift, HPE GreenLake, and IBM Cloud.
The following are the benefits of cloud computing:
There are three types of cloud computing, as follows:
Now that we have discussed the different types of cloud computing, let’s try to understand the various types of cloud services available in a public cloud distribution. The various types of cloud services are as follows:
In the cloud, the responsibility of owning various stacks in application development is shared between cloud vendors and the customers. The following diagram shows the shared responsibility model for these kinds of cloud computing services:
Figure 3.25 – Shared responsibility model
As we can see, if the customer is running a private cloud, all the resources, services, applications, and data are the customer’s responsibility. However, if you opt for a public cloud, then you can choose between IaaS, PaaS, and SaaS. Cloud vendors promise to manage and own infrastructure services such as compute, storage, and networking in an IaaS model. If you go for a PaaS model, apart from what you get in IaaS, cloud providers also manage the OS, VMs, and runtime so that you can own, develop, and manage your applications, data, and access. In SaaS, everything except data and access is managed by your cloud vendor. Even the application or software is managed by the cloud provider. Although this might be costlier if you take a single unit compared to the other two models, based on your business, it might be cheaper and more hassle-free.
With that, we have discussed the various platforms where data engineering applications may be deployed. Now, let’s discuss the various design choices that an architect needs to know to choose the correct platform for them.
In this section, we will look at one of the most important decisions architects have to make – how to choose the most suitable platform for a use case. Here, we will understand when to choose between virtualization versus containerization and on-premise versus the cloud when considering various cloud data platforms.
Although both these technologies ensure that we can use resources to the best of our ability by provisioning virtual resources, each has its advantages based on the type of application.
Microservices is a variant of the service-oriented architecture where an application is perceived as a collection of loosely coupled services. Each service is fine-grained and lightweight. Microservices are best suited for container-based platforms. For example, a REST service can be easily deployed using containers. Since microservices consist of loosely coupled services, they should be easily deployable and scalable. Since each service can be independently consumed and reused by other services and stacks, they need to be portable so that they can quickly migrate to any containerized platform.
On the other hand, monolithic applications are designed to perform multiple related tasks, but it is built as a tightly coupled single application. Such applications are more suited for small teams or Proof of Concept (POC) purposes. Another use case where such monolithic architectures are used is in legacy applications. Such monolithic applications are best suited for virtualization. Another use case where virtualized platforms are preferred over containerization is in any application that is dependent on an OS or talks directly to a specific OS.
However, in the cloud, all the servers that are provisioned are VMs. Containerized platforms such as Amazon Elastic Container Service (ECS) and Amazon Elastic Kubernetes Service (EKS) run on top of virtual servers such as Amazon Elastic Compute Cloud (EC2). So, in modern architectures, especially in the cloud, the question is not choosing between containerization and virtualization – it is choosing between containerization along with virtualization versus virtualization.
If we are handling data that is terabytes or petabytes in size, big data is a good choice. As artificial intelligence (AI) and machine learning (ML) applications are growing in popularity, we need to deal with huge volumes of data – the larger the data, more accurate will be the AI models. These volumes of data run into the terabytes. Processing such data in a scalable fashion can be done by big data applications. There are use cases where, due to processing complexity, processing hundreds of gigabytes of data takes an unnecessarily long time. In such scenarios, big data may be a good solution. Most big data use cases are for analytics, Online Analytical Processing (OLAP), AI, and ML.
This is an obvious question that architects face today. In this section, we will try to see what factors affect this decision, as well as recommend a few general criteria to help you decide on one over the other. The factors that will help you decide between on-premise versus cloud solutions are as follows:
Based on these factors, here are some broad guidelines that you can use to make this decision. However, note that these are only recommendations – the actual choice will depend on your specific business needs and context.
You should choose on-premise architectures in the following circumstances:
You should choose cloud architectures in the following circumstances:
Finally, let’s compare the Big 3 cloud vendors to decide which provider is a best fit for your business.
In this section, we will compare the Big 3 public cloud vendors and how they perform in various categories, even though there is no clear answer to the question, Which cloud vendor is best for my business? The following table provides a comparison between the Big 3 cloud providers and throws light on their strengths and weaknesses:
AWS |
Azure |
GCP | |
Services |
Huge range of services |
Good range of services available. Exceptional services in AI/ML. |
Limited services are available. |
Maturity |
Most mature |
Catching up with AWS. |
Still relatively less mature than the other two. |
Marketplace |
All vendors make their products available |
Good vendor support but less than AWS. | |
Reliability |
Excellent |
Excellent. |
Excellent. |
Security |
Excellent |
Excellent. |
Fewer notches than AWS and Azure. |
Cost |
Varies |
Most cost-efficient. |
Varies. |
AWS |
Azure |
GCP | |
Support |
Paid dev/enterprise support |
Paid dev/enterprise support. More support options than AWS. |
Paid dev/premium support. Costlier support than the other two. |
Hybrid Cloud Support |
Limited |
Excellent. |
Good. |
Special Notes |
More compute capacity versus Azure and GCP |
Easy integration and migrations for existing Microsoft services. |
Excellent support for containerized workloads. Global fiber network. |
Figure 3.26 – Comparison of the Big 3 cloud vendors
In short, AWS is the market leader but both Azure and GCP are catching up. If you are looking for the maximum number of services available across the globe, AWS will be your obvious choice, but it comes with a higher learning curve.
If your use case revolves only around AI/ML and you have a Microsoft on-premise infrastructure, Azure may be the correct choice. They have excellent enterprise support and hybrid cloud support. If you need a robust hybrid cloud infrastructure, Microsoft Azure is your go-to option.
GCP entered the race late, but they have excellent integration and support for open source and third-party services.
But in the end, it boils down to your specific use case. As the market is growing, most enterprises are looking for multi-cloud strategies to leverage the best of each vendor.
Now, let’s summarize this chapter.
In this chapter, we discussed various virtualization platforms. First, we briefly covered the architectures of the virtualization, containerization, and container orchestration frameworks. Then, we deployed VMs, Docker containers, and Kubernetes containers and ran an application on top of them. In doing so, we learned how to configure Dockerfiles and Kubernetes deployment scripts. After that, we discussed the Hadoop architecture and the various Hadoop distributions that are available on the market. Then, we briefly discussed cloud computing and its basic concepts. Finally, we covered the decisions that every data architect has to make: containers or VMs? Do I need big data processing? Cloud or on-premise? If the cloud, which cloud?
With that, we have a good understanding of some of the basic concepts and nuances of data architecting, including the basic concepts, databases, data storage, and the various platforms these solutions run on in production. In the next chapter, we will dive deeper into how to architect various data processing and data ingestion pipelines.
18.224.62.105