You should know that no matter what your application does, you will end up dealing with persistence sooner or later. Whether it's a payment, a post on social media, or anything else, information has no value if it's not stored, retrieved, aggregated, modified, and so on.
For this reason, data is very much a point of concern when designing an application. The wrong modeling (as we saw in Chapter 4, Best Practices for Design and Development, when talking about Domain-Driven Development) can lead to a weak application, which will be hard to develop and maintain.
In this chapter, we are taking data modeling a step further and discussing the ways your objects and values can be stored (also known as data at rest, as opposed to data in motion, where objects are still being actively manipulated by your application code).
In this chapter, we will cover the following topics:
As we have seen with many topics in this book so far, data persistence has also evolved a lot. Similar to what happened to the software development models and the Java Enterprise Edition (JEE) framework, when we deal with data, we also have a lot of different options to implement in several use cases.
However, just as we have seen elsewhere (namely, in JEE applications versus cloud-native alternatives), the old ways have not been abandoned (because they are still relevant in some cases); instead, they are being complemented by more modern approaches that are suited for other use cases. And this is exactly what happened with the first technology that we are going to discuss – relational databases.
Relational databases are hardly a new idea. The idea was first introduced by Edgar F. Codd in 1970. Omitting the mathematical concepts behind it (for brevity), it says that data in a relational database is, as everybody knows, arranged into tables (we had a quick look at this in Chapter 7, Exploring Middleware and Frameworks, in the Persistence section).
Roughly speaking, each table can be seen as one of the objects in our business model, with the columns mapping to the object fields and the rows (also known as records) representing the different object instances.
In the following sections, we are going to review the basics of relational databases, starting with keys and relationships, the concept of transactionality, and stored procedures.
Depending on the database technology, it's a common idea to have a way to identify each row. This is commonly done by identifying a field (or a set of fields) that is unique to each record. This is the concept of a primary key. Primary keys can be considered a constraint, meaning that they represent some rules with which the data inserted into the table must comply. Those rules need to be maintained for the table (and its records) to stay in a valid state (in this case, by having each record associated with a unique ID). However, other constraints are usually implemented in a relational database. Depending on the specific technology of the database system, these constraints may be really complex validation formulas.
Another core concept of the database world is the concept of relations. This is, as you can imagine, a way to model links between different objects (similar to what happens in the world of Object-Oriented Programming (OOP), where an object can contain references to other objects). The relations can fall into one of the following three cardinalities:
Here is a diagram of the three types of relationship cardinalities:
As you can see in the preceding diagram, there is a graphical representation of three examples of relationships:
These relationships are nothing new; the same is true for Java objects, with which you can model the same kinds of relationship:
All of those models can then be propagated into SQL databases, and this is indeed done by JPA, which we introduced in Chapter 7, Exploring Middleware and Frameworks.
It used to be common (and it still happens in some cases) to define the domain model of an application, starting with the design of the database that will store the data. It's quite a simplistic approach since it cannot easily model every aspect of object-oriented applications (such as inheritance, interfaces, and many other constructs), but it works for some simple scenarios.
One of the more interesting (and widely used) capabilities of a relational database is related to transactionality. Transactionality refers to a set of characteristics of relational databases that are the basis for maintaining data integrity (especially in the case of failures). These characteristics are united under the ACID acronym (which stands for Atomicity, Consistency, Isolation, and Durability):
Tip
Consider that the concept of transactionality is usually not very well suited to heavily distributed environments, such as microservices and cloud-native architecture. We will discuss this more in Chapter 9, Designing Cloud-Native Architectures.
Last but not least, many different technologies allow us to execute custom code directly on the database.
Many widely used databases can run complex programs. There is no standard for this kind of feature, even if the languages that are often used are similar to extensions of SQL, including conditions, loops, and similar statements. Occasionally, some general-purpose languages (such as Java and .NET) are available on some database systems.
The reason for storing code in a database is mainly data locality. By executing code in a database, the system has complete control over execution and transactional behavior (such as locking mechanisms); hence, you may end up getting very good performance. This may be particularly useful if you are doing batch operations and calculations on a large amount of data. But if you ask me, the advantages stop here and are not very worthwhile anyway.
When using stored procedures on a database system, you will observe small performance improvements, but the overall solution will be ugly from an architectural point of view and hard to maintain. Putting business logic in the data layer is never a good idea from a design point of view, and using special, procedural languages (such as the ones often available on such platforms) can only make things worse. Moreover, such languages are almost always impossible to port from one database system to another, hence strongly coupling your application with a specific technology and making it hard to change technology if needed.
Tip
Unless it's really needed, I advise avoiding stored procedures at all costs.
Now that we have seen a summary of the basic features, let's see the commonly used implementations of relational databases.
Let's quickly discuss some commonly used products that provide the relational database features we have seen so far:
This includes scenarios such as embedding the database as part of a development pipeline or a Maven task, when it can be programmatically destroyed, created, and launched any time you want. This makes it particularly useful in testing scenarios. Despite more complex setups being available (such as client servers), H2 is usually considered unsuitable for production usage. The most common use case, other than testing and development, is to ship it embedded with applications in order to provide a demo mode when an application is first started, suggesting that a different database should be set up and used before going into production.
Now that we have seen a brief selection of commonly used databases, let's have a look at the use cases where it's beneficial to use a relational database and when other options would be better.
Transactionality is the key feature of relational databases and is one of the advantages of using the technology. While other storage technologies can be configured to offer features similar to ACID transactions, if you need to reliably store structured data consistently, it's likely that a relational database is your best bet, both from a performance and a functionality standpoint. Moreover, through the SQL language, databases offer an expressive way to retrieve, combine, and manipulate data, which is critical for many use cases.
Of course, there are downsides too. A database needs a rigid structure to be defined upfront for tables, relations, and constraints (that's pretty much essential and inherent to the technology). Later changes are of course possible, but they can have a lot of side effects (typically in terms of performance and potential constraint violations), and for this reason, they are impactful and expensive. On the other hand, we will see that alternative technologies (such as NoSQL storage) can implement changes in the data structure more easily.
For this reason, a relational database may not be suitable in cases where we don't exactly know the shape of the data objects we are going to store. Another potential issue is that, given the complexity and rigidity of the technology, you may end up with performance and functional issues, which are not always easy to troubleshoot.
A typical example relates to complex queries. A relational database typically uses indexes to achieve better performance (each specific implementation may use different techniques, but the core concepts are often the same). Indexes must be maintained over time, with operations such as defragmentation and other similar ones (depending on each specific database implementation). If we fail to properly perform such maintenances, this may end up impacting heavily on the performance. And even if our indexes are working correctly, complex queries may still perform poorly.
This is because, in most practical implementations, you will need to combine and filter data from many different tables (an operation generally known as a join). These operations may be interpreted in many different ways by databases that will try to optimize the query times but will not guarantee good results in every case (especially when many tables and rows are involved).
Moreover, when doing complex queries, you may end up not correctly using the indexes, and small changes in a working query may put you in the same situation. For this reason, my suggestion is, in complex application environments, to make sure to always double-check your queries in advance with the database administrators, who are likely to have tools and experience for identifying potential issues before they slip into production environments.
As we have seen in this section, relational databases, while not being the most modern option, are still a very widespread and useful technology for storing data, especially when you have requirements regarding data integrity and structure. However, this comes at the cost of needing to define the data structure upfront and in having some discipline in the maintenance and usage of the database.
You should also consider that, sometimes, relational databases may simply be overkill for simple use cases, where you just need simple queries and maybe not even persistence. We are going to discuss this scenario in the next section.
There are scenarios in which you simply need temporary storage and are going to access it in a simple way, such as by a known unique key, which will be associated with your object. This scenario is the best for key/value stores. Within this concept, you can find a lot of different implementations, which usually share some common features. The basic one is the access model – almost every key/value store provides APIs for retrieving data by using a key. This is basically the same mechanism as hash tables in Java, which guarantee maximum performance. Data retrieved in this way can be serialized in many different ways. The most basic way, for simple values, is strings, but Protobuf is another common choice (see Chapter 8, Designing Application Integration and Business Automation, where we discussed this and other serialization technologies).
A key/value store may not offer persistent storage options, as that is not the typical use case. Data is simply kept in memory to be optimized for performance. Modern implementations, however, compromise by serializing data on disk or in an external store (such as a relational database). This is commonly done asynchronously to reduce the impact on access and save times.
Whether the technology you are using is providing persistent storage or not, there are other features for enhancing the reliability of a system. The most common one is based on data replication. Basically, you will have more than one system (also called nodes) running in a clustered way (meaning that they are talking to each other). Such nodes may be running on the same machine or, better yet, in different locations (to increase the reliability even more).
Then, the technology running your key/value store may be configured to propagate each change (adding, removing, or modifying data) into a number of different nodes (optionally, all of them). In this way, in case of the failure of a node, your data will still be present in one or more other nodes. This replication can be done synchronously (reducing the possibility of data loss but increasing the latency of each write operation) or asynchronously (the other way around).
In the upcoming sections, we are going to see some common scenarios relating to caching data and the life cycle of records stored in the key/value store. Let's start looking at some techniques to implement data caching.
A typical use case for key/value stores is caching. You can use a cache as a centralized location to store disposable data that's quickly accessible from your applications. Such data is typically considered disposable because it can be retrieved in other ways (such as from a relational database) if the key/value store is unavailable or doesn't have the data.
So, in an average case (sometimes referred to as a cache hit), you will have better performance and will avoid going into other storage (such as relational databases), which may be slow, overloaded, or expensive to access. In a worst-case scenario (sometimes referred to as a cache miss), you will still have other ways to access your data.
Some common scenarios are as follows:
This scenario can be managed by notifying the key/value store about any change occurring in the persistent storage. This can be done by the application writing data, or it can be done directly by the persistent storage (if it is a feature provided by the technology) using a pattern known as change data capture. The key/value store may then decide to update the changed data or simply delete it from the cached view (forcing a retrieve from the persistent store when your application will look again for the same key).
Another common topic when talking about key/value stores is the life cycle of the data.
Since they use memory heavily, with huge datasets you may want to avoid having everything in memory, especially if the access patterns are identifiable (for example, you can foresee with reasonable accuracy which data will be accessed by your application). Common patterns for deciding what to keep in memory and what to delete are as follows:
Key/value stores lack a standard language, such as SQL. It's also for this reason that key/value stores are a big family, including many different products and libraries, often offering more features than just key/value management. In the next section, we are going to see a few of the most famous implementations of key/value stores.
As previously mentioned, it's not easy to build a list of key/value store technology implementations. As we will see in the next few sections, this way of operating a database is considered to be a subcategory of a bigger family of storage systems, called NoSQL databases, offering more options and alternatives than just key/value storage. However, for the purpose of this section, let's see a list of what is commonly used in terms of key/value stores:
Now that we have seen some widespread key/value stores, let's see when they are a good fit and when they are not.
The most important advantage of key/value stores is the performance. The access time can be incredibly fast, especially when used without any persistent storage (in-memory only). This makes them particularly suitable for low-latency applications. Another advantage is simplicity, both from an architectural and a usage point of view.
Architecturally speaking, if your use case doesn't require clustering and other complex settings, a key/value store can be as simple as a single application exposing an API to retrieve and store records. From a usage point of view, most use cases can be implemented with primitives as simple as get, put, and delete. However, some of these points can become limitations of key/value stores, especially when you have different requirements. If your application needs to be reliable (as in losing as little data as possible when there's a failure), you may end up with complex multi-node setups and persistence techniques. This may, in turn, mean that in some cases, you can have inconsistency in data that may need to be managed from an application point of view.
Another common issue is that, usually, data is not structured in key/value stores. This means that it is only possible to retrieve data searching by key (or at least, that's the most appropriate scenario). While some implementations allow it, it can be hard, performance-intensive, or in some cases impossible to retrieve data with complex queries on the object values, in contrast with what you can do with SQL in relational databases.
In this section, we have covered the basics of data caching and key/value stores. Such techniques are increasingly used in enterprise environments, for both their positive impact on performances and their scalability, which fit well with cloud-native architectures. Topics such as data caching techniques and the life cycles of objects are common considerations to be made when adopting key/value stores.
Key/value stores are considered to be part of a broader family of storage technologies that are alternatives to relational databases, called NoSQL. In the next section, we will go into more detail about this technology.
NoSQL is an umbrella term comprising a number of very different data storage technologies. The term was coined mostly for marketing purposes in order to distinguish them from relational databases. Some NoSQL databases even support SQL-like query languages. NoSQL databases claim to outdo relational databases in terms of performance. However, this assurance only exists because of some compromises, namely the lack of some features, usually in terms of transactionality and reliability. But to discuss these limitations, it is worth having an overview of the CAP theorem.
The CAP theorem was theorized by Eric Brewer in 1998 and formally proven valid in 2002 by Seth Gilbert and Nancy Lynch. It refers to a distributed data store, regardless of the underlying technology, so it's also applicable to relational databases when instantiated in a multi-server setup (so, running in two or more different processes, communicating through a network, for clustering and high-availability purposes). The theorem focuses on the concept of a network split, when the system becomes partitioned into two (or more) subsets that are unable to communicate with each other due to connectivity loss.
The CAP theorem describes three core characteristics of distributed data stores:
The CAP theorem states that, when a partition occurs, you can only preserve consistency or availability. While a mathematical explanation is available (and beyond the scope of this book), the underlying idea can be understood easily:
However, it's worth noticing that this theorem, while being the basis for understanding the distributed data store limits, must be considered and contextualized in each particular scenario. In many enterprise contexts, it is possible to make the event of a network split extremely unlikely (for example, by providing multiple network connections between each server).
Moreover, it's common to have mechanisms to elect a primary partition when there's a network split. This basically means that if you are able to define which part of the cluster is primary (typically, the one with the greater number of survival nodes, and this is why it's usually recommended to have an odd number of nodes), this partition can keep working as usual, while the remaining partition can shut down or switch to a degraded mode (such as read-only). So, basically, it's crucial to understand the basics of the CAP theorem, but it's also important to understand that there are a number of ways to work around the consequences.
This is exactly the reasoning behind NoSQL databases. These databases shift their point of view, stretching a bit over the CAP capabilities. This means that, while traditional relational databases focus on consistency and availability, they are often unreliable to operate in a heavily distributed fashion. Conversely, NoSQL databases can operate better in horizontally distributed architectures, favoring scalability, throughput, and performance at the expense of availability (as we saw, becoming read-only when there are network partitions) or consistency (not providing ACID transaction capabilities).
And this brings us to another common point of NoSQL stores – the eventual consistency.
Indeed, most NoSQL stores, while not providing full transactionality (compared to relational databases) can still offer some data integrity by using the pattern of eventual consistency. Digging into the details and impacts of this pattern would require a lot of time. For the sake of this section, it's sufficient to consider that a system implementing eventual consistency may have some periods of time in which data is not coherent (in particular, enquiring for the same data on two different nodes can lead to two different results).
With that said, it's usually possible to tune a NoSQL store in order to preserve consistency and provide full transactionality as a traditional relational database does. But in my personal experience, the impacts in terms of reduced performance and availability are not a worthwhile compromise. In other words, if you are looking for transactionality and data consistency, it's usually better to rely on relational databases.
With that said, let's have an overview of the different NoSQL database categories.
As we discussed in the previous sections, NoSQL is an umbrella term. There are a number of different categories of NoSQL stores:
Of course, as you can imagine, there is a lot more to say about NoSQL databases. I hope the pointers I gave in this section will help you quickly understand the major features of NoSQL databases, and I hope one of the examples I've provided will be useful for your software architecture. In the next section, we are going to have a look at filesystem storage.
Filesystems are a bit of a borderline concept when it comes to data storage systems. To be clear, filesystem storage is a barely structured system providing APIs, schemas, and advanced features, like the other storage systems that we have seen so far. However, it is still a very relevant layer in many applications, and there are some new storage infrastructures that provide advanced features, so I think it's worth having a quick overview of some core concepts.
Filesystem storage should not be an alien concept to most of us. It is a persistent storage system backed by specific hardware (spinning or solid-state disks). There are many different filesystems, which can be considered the protocol used to abstract the read and write operations from and to such specific hardware. Other than creating, updating, and deleting files, and the arrangement of these files into folders, filesystems can provide other advanced features, such as journaling (to reduce the risk of data corruption) and locking (in order to provide exclusive access to files).
Some common filesystems are the New Technology File System (NTFS) (used in Windows environments) and the Extended File System (ext) (used in Linux environments). However, these filesystems are designed for working on a single machine. A more important concept relates to the filesystems that allow interactions between different systems. One such widespread implementation is networked filesystems, which is a family of filesystem protocols providing access to files and directories over a network. The most notable example here is NFS, which is a protocol that provides multi-server access to a shared filesystem. The File Transfer Protocol (FTP) and the SSH File Transfer Protocol (SFTP) are other famous examples, and even if they are outdated, they are still widely used.
A recent addition to the family of network storage systems is Amazon S3. While it's technically an object filesystem, it's a way to interact with Amazon facilities using APIs in order to store and retrieve files. It started as a proprietary implementation for providing filesystem services on AWS infrastructure over the internet; since then, S3 has become a standard, and there are a lot of other implementations, both open source and commercial, aiming to provide S3-compliant storage on-premises and in the cloud.
It's hard to talk about the disadvantages of filesystems because they are an essential requirement in every application, and it will stay like this for a long time. However, it's important to contextualize and think logically about the pros and cons of filesystems to better understand where to use them.
Application interaction over shared filesystems is particularly convenient when it comes to exchanging large amounts of data. In banking systems (especially legacy ones), it's common to exchange large numbers of operations (such as payments) to be performed in batches, in the form of huge .csv files. The advantage is that the files can be safely chunked, signed, and efficiently transferred over a network.
On the other hand, filesystems don't usually offer native indexing and full-text search, so these capabilities must be implemented on top. Moreover, filesystems (especially networked filesystems) can perform badly, especially when it comes to concurrent access and the locking of files.
With this section, we have completed our overview of storage systems.
In the next section, we are going to see how, in modern architecture, it is common to use more than one storage solution to address different use cases with the most suitable technology.
In the final section of the chapter, we'll be exploring a concept that may seem obvious, but it's still worth mentioning. Modern architecture tends to use multiple data storage solutions, and I think that this could be a particularly interesting solution.
In the past, it was common to start by defining a persistence strategy (typically on a relational database or on another legacy persistence system) and build the application functionalities around it. This is no longer the case. Cloud-native technologies, through microservices, developed the idea that each microservice should own its own data, and we can extend this concept in that each microservice could choose its own persistent storage technology. This is better suited for the particular characteristics of that business domain and the related use cases. Some services may need to focus on performance, while others will have a strong need for transactionality and data consistency.
However, even if you are dealing with a less innovative architecture, it's still worthwhile evaluating different ideas around data persistence solutions. Here are some discussion points about it:
So, once again, choosing the right data storage technology can be crucial to have a performant and well-written application, and it's a common practice to rely on more than one technology to meet the different needs that different parts of our application will require.
In this chapter, we have seen an overview of different possibilities on the data layer, ranging from traditional SQL databases to more modern alternatives.
While most of us are already familiar with relational databases, we have had a useful examination of the pros and cons of using this technology. We then broadened our view with alternative, widespread storage technologies, such as key/value stores, NoSQL, and even filesystems.
Eventually, we looked at how the choice of a particular way of storing data may affect both the application design and the performance of our system. Indeed, in modern architecture, we may want to pick the right storage solution for each use case by choosing different solutions where needed.
In the next chapter, we are going to discuss some architectural cross-cutting concerns. Topics such as security, resilience, usability, and observability are crucial to successful application architecture and will be analyzed to see their impacts and best practices.
18.119.253.93