With the advent of the 21st century, due to more and more internet usage and more powerful data insight tools and technologies emerging, there has been a data explosion, and data has become the new gold. This has implied an increased demand for useful and actionable data, as well as the need for quality data engineering solutions. However, architecting and building scalable, reliable, and secure data engineering solutions is often complicated and challenging.
A poorly architected solution often fails to meet the needs of the business. Either the data quality is poor, it fails to meet the SLAs, or it’s not sustainable or scalable as the data grows in production. To help data engineers and architects build better solutions, every year, dozens of open source and preoperatory tools get released. Even a well-designed solution sometimes fails because of a poor choice or implementation of the tools.
This book discusses various architectural patterns, tools, and technologies with step-by-step hands-on explanations to help an architect choose the most suitable solution and technology stack to solve a data engineering problem. Specifically, it focuses on tips and tricks to make architectural decisions easier. It also covers other essential skills that a data architect requires such as data governance, data security, performance engineering, and effective architectural presentation to customers or upper management.
In this chapter, we will explore the landscape of data engineering and the basic features of data in modern business ecosystems. We will cover various categories of modern data engineering problems that a data architect tries to solve. Then, we will learn about the roles and responsibilities of a Java data architect. We will also discuss the challenges that a data architect faces while designing a data engineering solution. Finally, we will provide an overview of the techniques and tools that we’ll discuss in this book and how they will help an aspiring data architect do their job more efficiently and be more productive.
In this chapter, we’re going to cover the following main topics:
In this section, you will learn what data engineering is and why it is needed. You will also learn about the various categories of data engineering problems and some real-world scenarios where they are found. It is important to understand the varied nature of data engineering problems before you learn how to architect solutions for such real-world problems.
By definition, data engineering is the branch of software engineering that specializes in collecting, analyzing, transforming, and storing data in a usable and actionable form.
With the growth of social platforms, search engines, and online marketplaces, there has been an exponential increase in the rate of data generation. In 2020 alone, around 2,500 petabytes of data was generated by humans each day. It is estimated that this figure will go up to 468 exabytes per day by 2025. The high volume and availability of data have enabled rapid technological development in AI and data analytics. This has led businesses, corporations, and governments to gather insights like never before to give customers a better experience of their services.
However, raw data usually is seldom used. As a result, there is an increased demand for creating usable data, which is secure and reliable. Data engineering revolves around creating scalable solutions to collect the raw data and then analyze, validate, transform, and store it in a usable and actionable format. Optionally, in certain scenarios and organizations, in modern data engineering, businesses expect usable and actionable data to be published as a service.
Before we dive deeper, let’s explore a few practical use cases of data engineering:
Now that we understand data engineering, let’s look at a few of its basic concepts. We will start by looking at the dimensions of data.
Any discussion on data engineering is incomplete without talking about the dimensions of data. The dimensions of data are some basic characteristics by which the nature of data can be analyzed. The starting point of data engineering is analyzing and understanding the data.
To successfully analyze and build a data-oriented solution, the four Vs of modern data analysis are very important. These can be seen in the following diagram:
Figure 1.1 – Dimensions of data
Let’s take a look at each of these Vs in detail:
Now that we have a fair idea of the characteristics by which the nature of data can be analyzed, let’s understand how they play a vital role in different types of data engineering problems.
Broadly speaking, the kinds of problems that data engineers solve can be classified into two basic types:
Let’s take a look at these problems in more detail.
The problems that are related to collecting raw data or events, processing them, and storing them in a usable or actionable data format are broadly categorized as processing problems. Typical use cases can be a data ingestion problem such as Extract, Transform, Load (ETL) or a data analytics problem such as generating a year-on-year report.
Again, processing problems can be divided into three major categories, as follows:
This can be seen in the following diagram:
Figure 1.2 – Categories of processing problems
Let’s take a look at each one of these categories in detail.
If the SLA of processing is more than 1 hour (for example, if the processing needs to be done once in 2 hours, once daily, once weekly, or once biweekly), then such a problem is called a batch processing problem. This is because, when a system processes data at a longer time interval, it usually processes a batch of data records and not a single record/event. Hence, such processing is called batch processing:
Figure 1.3 – Batch processing problem
Usually, a batch processing solution depends on the volume of data. If the data volume is more than tens of terabytes, usually, it needs to be processed as big data. Also, since big data processes are schedule-driven, a workflow manager or schedular needs to run its jobs. We will discuss batch processing in more detail later in this book.
A real-time processing problem is a use case where raw data/events are to be processed on the fly and the response or the processing outcome should be available within seconds, or at most within 2 to 5 minutes.
As shown in the following diagram, a real-time process receives data in the form of an event stream and immediately processes it. Then, it either sends the processed event to a sink or to another stream of events to be processed further. Since this kind of processing happens on a stream of events, this is known as real-time stream processing:
Figure 1.4 – Real-time stream processing
As shown in Figure 1.4, event E0 gets processed and sent out by the streaming application, while events E1, E2 and E3 are waiting to be processed in the queue. At t1, event E1 also gets processed, showing continuous processing of events by streaming application
An event can generate at any time (24/7), which creates a new kind of problem. If the producer application of an event directly sends the event to a consumer, there is a chance of event loss, unless the consumer application is running 24/7. Even bringing down the consumer application for maintenance or upgrades isn’t possible, which means there should be zero downtime for the consumer application. However, any application with zero downtime is not realistic. Such a model of communication between applications is called point-to-point communication.
Another challenge in point-to-point communication for real-time problems is the speed of processing as this should be always equal to or greater than that of a producer. Otherwise, there will be a loss of events or a possible memory overrun of the consumer. So, instead of directly sending events to the consumer application, they are sent asynchronously to an Event Bus or a Message Bus. An Event Bus is a high availability container that can hold events such as a queue or a topic. This pattern of sending and receiving data asynchronously by introducing a high availability Event Bus in between is called the Pub-Sub framework.
The following are some important terms related to real-time processing problems:
A real-world example of a real-time problem is credit card fraud detection, where you might have experienced an automated confirmation call to verify the authenticity of a transaction from your bank, if any transaction seems suspicious while being executed.
Near-real-time processing, as its name suggests, is a problem whose response or processing time doesn’t need to be as fast as real time but should be less than 1 hour. One of the features of near-real-time processing is that it processes events in micro batches. For example, a near-real-time process may process data in a batch interval of every 5 minutes, a batch size of every 100 records, or a combination of both (whichever condition is satisfied first).
At time tx, all events (E1, E2 and E3) that are generated between t0 and tx are processed together by near real-time processing job. Similarly all events (E4, E5 and E6) between time tx and tn are processed together.
Figure 1.5 – Near-real-time processing
Typical near-real-time use cases are recommendation problems such as product recommendations for services such as Amazon or video recommendations for services such as YouTube and Netflix.
Publishing problems deal with publishing the processed data to different businesses and teams so that data is easily available with proper security and data governance. Since the main goal of the publishing problem is to expose the data to a downstream system or an external application, having extremely robust data security and governance is essential.
Usually, in modern data architectures, data is published in one of three ways:
Let’s take a closer look at each.
Sorted data repositories is a common term used for various kinds of repositories that are used to store processed data. This is usable and actionable data and can be directly queried by businesses, analytics teams, and other downstream applications for their use cases. They are broadly divided into three types:
A data warehouse is a central repository of integrated and structured data that’s mainly used for reporting, data analysis, and Business Intelligence (BI). A data lake consists of structured and unstructured data, which is mainly used for data preparation, reporting, advanced analytics, data science, and Machine Learning (ML). A data hub is the central repository of trusted, governed, and shared data, which enables seamless data sharing between diverse endpoints and connects business applications to analytic structures such as data warehouses and data lakes.
Another publishing pattern is where data is published as a service, popularly known as Data as a Service. This data publishing pattern has many advantages as it enables security, immutability, and governance by design. Nowadays, as cloud technologies and GraphQL are becoming popular, Data-as-a-Service is getting a lot of traction in the industry.
The two popular mechanisms of publishing Data as a Service are as follows:
We will discuss these techniques in detail later in this book.
There’s a popular saying: A picture is worth a thousand words. Visualization is a technique by which reports, analytics, and statistics about the data are captured visually in graphs and charts.
Visualization is helpful for businesses and leadership to understand, analyze, and get an overview of the data flowing in their business. This helps a lot in decision-making and business planning.
A few of the most common and popular visualization tools are as follows:
Important note
A Lucene index is a full-text inverse index. This index is extremely powerful and fast for text-based searches and is the core indexing technology behind most search engines. A Lucene index takes all the documents, splits them into words or tokens, and then creates an index for each word.
While we have briefly discussed a few of the visualization tools available in the market, there are many visualizations and competitive alternatives available. Discussing data visualization in more depth is beyond the scope of this book.
So far, we have provided an overview of data engineering and the various types of data engineering problems. In the next section, we will explore what role a Java data architect plays in the data engineering landscape.
Data architects are senior technical leaders who map business requirements to technical requirements, envision technical solutions to solve business problems, and establish data standards and principles. Data architects play a unique role, where they understand both the business and technology. They are like the Janus of business and technology, where on one hand they can look, understand, and communicate with the business, and on the other, they do the same with technology. Data architects create processes that are used to plan, specify, enable, create, acquire, maintain, use, archive, retrieve, control, and purge data. According to DAMMA’s data management body of knowledge, a data architect provides a standard common business vocabulary, expresses strategic requirements, outlines high-level integrated designs to meet those requirements, and aligns with the enterprise strategy and related business architecture.
The following diagram shows the cross-cutting concerns that a data architect handles:
Figure 1.6 – Cross-cutting concerns of a data architect
The typical responsibilities of a Java data architect are as follows:
In the real world, a data architect is supposed to play a combination of three disparate roles, as shown in the following diagram:
Figure 1.7 – Multifaced role of a data architect
Let’s look at these three architectural roles in more detail:
Let’s understand the difference between a data architect and data engineer.
The data architect and data engineer are related roles. A data architect visualizes, conceptualizes, and creates the blueprint of the data engineering solution and framework, while the data engineer takes the blueprint and implements the solution.
Data architects are responsible for putting data chaos in order, generated by enormous piles of business data. Each data analytics or data science team requires a data architect who can visualize and design the data framework to create clean, analyzed, managed, formatted, and secure data. This framework can be utilized further by data engineers, data analysts, and data scientists for their work.
Data architects face a lot of challenges in their day-to-day work. We will be focusing on the main challenges that a data architect faces on a day-to-day basis:
Let’s take a closer look.
A single data engineering problem can be solved in many ways. However, with the ever-evolving expectations of customers and the evolution of new technologies, choosing the correct architectural pattern has become more challenging. What is more interesting is that with the changing technological landscape, the need for agility and extensibility in architecture has increased many folds to avoid unnecessary costs and sustainability of architecture over time.
One of the complex problems that a data architect needs to figure out is the technology stack. Even when you have created a very well-architected solution, whether your solution will fly or flop will depend on the technology stack you are choosing and how you are planning to use it. As more and more tools, technologies, databases, and frameworks are developed, a big challenge remains for data architects to choose an optimum tech stack that can help create a scalable, reliable, and robust solution. Often, a data architect needs to take into account other non-technical factors as well, such as the future growth prediction of the tool, the market availability of skilled resources for those tools, vendor lock-in, cost, and community support options.
Data governance is a buzzword in data businesses, but what does it mean? Governance is a broad area that includes both workflows and toolsets to govern data. If either the tools or the workflow process has limitations or is not present, then data governance is incomplete. When we talk about actionable governance, we mean the following elements:
Data governance should always be aligned with strategic and organizational goals.
Creating an optimal architecture and the correct set of tools is a challenging task, but it never is enough, unless and until they are not put into practice. One of the hats that a data architect often needs to wear is that of a sales executive who needs to sell their solution to the business executive or upper leadership. These are not usually technical people and they don’t have a lot of time. Data architects, most of whom have strong technical backgrounds, face the daunting task of communicating and selling their idea to these people. To convince them about the opportunity and the idea, a data architect needs to back them up with proper decision metrics and information that can align that opportunity to the broader business goals of the organization.
So far, we have seen the role of a data architect and the common problems that they face. In the next section, we will provide an overview of how a data architect mitigates those challenges on a day-to-day basis.
In this section, we will discuss how a data architect can mitigate the aforementioned challenges. To understand the mitigation plan, it is important to understand what the life cycle of a data architecture looks like and how a data architect contributes to it. The following diagram shows the life cycle of a data architecture:
Figure 1.8 – Life cycle of a data architecture
The data architecture starts with defining the problem that the business is facing. Here, this is mainly identified or reported by business teams or customers. Then, the data architects work closely with the business to define the business requirements. However, in a data engineering landscape, that is not enough. In a lot of cases, there are hidden requirements or anomalies. To mitigate such problems, business analysts team up with data architects to analyze data and the current state of the system, including any existing solution, the current cost, or loss of revenue due to the problem and infrastructure where data resides. This helps refine the business requirements. Once the business requirements are more or less frozen, the data architects map the business requirements to the technical requirements.
Then, the data architect defines the standards and principles of the architecture and determines the priorities of the architecture based on the business need and budget. After that, the data architect creates the most suitable architectures, along with their proposed technology stack. In this phase, the data architects closely work with the data engineers to implement proof of concept (POCs) and evaluate the proposed solution in terms of feasibility, scalability, and performance.
Finally, the architects recommend solutions based on the evaluation results and architectural priorities defined earlier. The data architects present the proposed solutions to the business. Based on priorities such as cost, timeline, operational cost, and resource availability, feedback is received from the business and clients. It takes a few iterations to solidify and get an agreement on the architecture.
Once an agreement has been reached, the solution is implemented. Based on the implementation challenges and particular use cases, the architecture may or may not be revised or tweaked a little. Once an architecture is implemented and goes to production, it enters the maintenance and operations phase. During maintenance and operations, sometimes, feedback is provided, which might result in a few architectural improvements and changes, but they are often seldom if the solution is well-architected in the first place.
In the preceding diagram, the blue boxes indicate major involvement from a customer, a green box indicates major involvement from a data architect, a yellow box means a data architect equally shares involvement with another stakeholder, and a gray box means the data architect has the least involvement in that scenario.
Now that we have understood the life cycle of the data architecture and a data architect’s role in various phases, we will focus on how to mitigate those challenges that are faced by a data architect. This book covers how to mitigate those challenges in the following way:
In this section, we discussed how this book can help an architect overcome the various challenges they will face and make them more effective in their role. Now, let’s summarize this chapter.
In this chapter, we learned what data engineering is and looked at a few practical examples of data engineering. Then, we covered the basics of data engineering, including the dimensions of data and the kinds of problems that are solved by data engineers. We also provided a high-level overview of various kinds of processing problems and publishing problems in a data engineering landscape. Then, we discussed the roles and responsibilities of a data architect and the kind of challenges they face. We also briefly covered the way this book will guide you to overcome challenges and dilemmas faced by a data architect and help you become a better Java data architect.
Now that you understand the basic landscape of data engineering and what this book will focus on, in the next chapter, we will walk through various data formats, data storage options, and databases and learn how to choose one for the problem at hand.
18.218.137.93