Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 11: Dealing with Data

You should know that no matter what your application does, you will end up dealing with persistence sooner or later. Whether it's a payment, a post on social media, or anything else, information has no value if it's not stored, retrieved, aggregated, modified, and so on.

For this reason, data is very much a point of concern when designing an application. The wrong modeling (as we saw in Chapter 4, Best Practices for Design and Development, when talking about Domain-Driven Development) can lead to a weak application, which will be hard to develop and maintain.

In this chapter, we are taking data modeling a step further and discussing the ways your objects and values can be stored (also known as data at rest, as opposed to data in motion, where objects are still being actively manipulated by your application code).

In this chapter, we will cover the following topics:

Exploring relational databases
Introducing key/value stores
Exploring NoSQL repositories
Looking at filesystem storage
Modern approaches – a multi-tier storage strategy

As we have seen with many topics in this book so far, data persistence has also evolved a lot. Similar to what happened to the software development models and the Java Enterprise Edition (JEE) framework, when we deal with data, we also have a lot of different options to implement in several use cases.

However, just as we have seen elsewhere (namely, in JEE applications versus cloud-native alternatives), the old ways have not been abandoned (because they are still relevant in some cases); instead, they are being complemented by more modern approaches that are suited for other use cases. And this is exactly what happened with the first technology that we are going to discuss – relational databases.

Exploring relational databases

Relational databases are hardly a new idea. The idea was first introduced by Edgar F. Codd in 1970. Omitting the mathematical concepts behind it (for brevity), it says that data in a relational database is, as everybody knows, arranged into tables (we had a quick look at this in Chapter 7, Exploring Middleware and Frameworks, in the Persistence section).

Roughly speaking, each table can be seen as one of the objects in our business model, with the columns mapping to the object fields and the rows (also known as records) representing the different object instances.

In the following sections, we are going to review the basics of relational databases, starting with keys and relationships, the concept of transactionality, and stored procedures.

Keys and relationships

Depending on the database technology, it's a common idea to have a way to identify each row. This is commonly done by identifying a field (or a set of fields) that is unique to each record. This is the concept of a primary key. Primary keys can be considered a constraint, meaning that they represent some rules with which the data inserted into the table must comply. Those rules need to be maintained for the table (and its records) to stay in a valid state (in this case, by having each record associated with a unique ID). However, other constraints are usually implemented in a relational database. Depending on the specific technology of the database system, these constraints may be really complex validation formulas.

Another core concept of the database world is the concept of relations. This is, as you can imagine, a way to model links between different objects (similar to what happens in the world of Object-Oriented Programming (OOP), where an object can contain references to other objects). The relations can fall into one of the following three cardinalities:

A one-to-one relationship represents a mapping of each record to one, and only one, record from another table. This is usually referring to a relationship in which each row points to a row containing further information, such as a user record pointing to a row representing the user's living address in another table.
A one-to-many relationship is where we model a relation in which each record maps to a set of records in another table. In this case, the relation between the two tables is unbalanced. One record in a table refers to a set of related records in another table, while the reverse is not valid (each record maps to one and only one record in the source table). A practical example is a user versus payment relationship. Each user is associated with one or more payments, while each payment is linked to only one user.
A many-to-many relationship is the last option. Basically, in this case, multiple rows from a table can relate to multiple rows in the related tables, and vice versa. An example of this kind of relationship is movies and actors. A record in a movie table will link to more than one row in the actor table (implementing the relation of actors starring in a movie). And the reverse is true – a row in the actor table will link to many records in the movie table, as each actor will most likely be part of more than one movie.

Here is a diagram of the three types of relationship cardinalities:

Figure 11.1 – Table relationships

As you can see in the preceding diagram, there is a graphical representation of three examples of relationships:

One to One, as in a person with address – each person can have just one primary home address.
One to Many, as in a person with transactions – each person can be associated with more than one payment transaction.
Many to Many, as in people with devices – each person can have many devices, and a device can be used by more than one person.

These relationships are nothing new; the same is true for Java objects, with which you can model the same kinds of relationship:

A class can be linked one-to-one with another one, by having a field of it.
A class can be linked in a one-to-many scenario by having a field containing a list (or a set) of objects of the target class type.
A class can implement a many-to-many scenario by extending the previous scenario and having the target class type with a field containing a list (or set) of objects of the source class type (hence linking back and forth).

All of those models can then be propagated into SQL databases, and this is indeed done by JPA, which we introduced in Chapter 7, Exploring Middleware and Frameworks.

It used to be common (and it still happens in some cases) to define the domain model of an application, starting with the design of the database that will store the data. It's quite a simplistic approach since it cannot easily model every aspect of object-oriented applications (such as inheritance, interfaces, and many other constructs), but it works for some simple scenarios.

Transactionality

One of the more interesting (and widely used) capabilities of a relational database is related to transactionality. Transactionality refers to a set of characteristics of relational databases that are the basis for maintaining data integrity (especially in the case of failures). These characteristics are united under the ACID acronym (which stands for Atomicity, Consistency, Isolation, and Durability):

Atomicity: Each transaction (which is typically a set of different operations, such as the creation, modification, or deletion of records in one or more tables) is treated as a single unit; it will be successful as a whole, or it will fail completely (leaving all the tables as they were before the transaction started).
Consistency: Each transaction can only change the database into a valid state by maintaining all the existing constraints (such as primary keys).
Isolation: The concurrent transactions must be executed correctly with no interference from other transactions. This basically means that the final effect of a number of transactions executed in parallel should be the same as the same transactions being executed sequentially.
Durability: This simply refers to the guarantee that a persisted transaction will be maintained (and can be retrieved) after a failure of the database system. In other words, the database should persist the data into non-volatile storage (a disk or similar technology).
Tip
Consider that the concept of transactionality is usually not very well suited to heavily distributed environments, such as microservices and cloud-native architecture. We will discuss this more in Chapter 9, Designing Cloud-Native Architectures.

Last but not least, many different technologies allow us to execute custom code directly on the database.

Stored procedures

Many widely used databases can run complex programs. There is no standard for this kind of feature, even if the languages that are often used are similar to extensions of SQL, including conditions, loops, and similar statements. Occasionally, some general-purpose languages (such as Java and .NET) are available on some database systems.

The reason for storing code in a database is mainly data locality. By executing code in a database, the system has complete control over execution and transactional behavior (such as locking mechanisms); hence, you may end up getting very good performance. This may be particularly useful if you are doing batch operations and calculations on a large amount of data. But if you ask me, the advantages stop here and are not very worthwhile anyway.

When using stored procedures on a database system, you will observe small performance improvements, but the overall solution will be ugly from an architectural point of view and hard to maintain. Putting business logic in the data layer is never a good idea from a design point of view, and using special, procedural languages (such as the ones often available on such platforms) can only make things worse. Moreover, such languages are almost always impossible to port from one database system to another, hence strongly coupling your application with a specific technology and making it hard to change technology if needed.

Tip

Unless it's really needed, I advise avoiding stored procedures at all costs.

Now that we have seen a summary of the basic features, let's see the commonly used implementations of relational databases.

Commonly used implementations of relation databases

Let's quickly discuss some commonly used products that provide the relational database features we have seen so far:

We cannot talk about relational databases without mentioning Oracle (https://www.oracle.com/database/). The name of this vendor has become synonymous with databases. They provide many variants, including versions with clustering and embedded caching. This database is considered a de facto standard in many enterprises, and most commercially available software packages are compatible with Oracle databases. Oracle databases support Java and PL/SQL (a proprietary language) as ways to define stored procedures.
Microsoft SQL Server (https://www.microsoft.com/sql-server/) is another widely used database server. It became popular for its complete features and proximity with the Microsoft ecosystem, as many widespread Microsoft applications use it. It also offers extensions for running .NET languages as part of stored procedures. It's worth noting that for a couple of years, SQL Server has also been supported on Linux servers, widening the use cases for SQL Server, especially in cloud environments.
MySQL (https://www.mysql.com/) is another widely used database technology. It's one of the first examples of an open source database and provides advanced features comparable to commercial databases. After the MySQL project was acquired by Oracle, a couple of forks have been created in order to keep the project autonomous. The most important fork currently available is called MariaDB.
PostgreSQL (https://www.postgresql.org/) is another open source relational database and has been available for a very long time (it was released shortly after the first release of MySQL). In contrast with MySQL, however, it's still independent, meaning that it hasn't been acquired by a major software vendor. For this reason and because of the completeness of its features, it is still a widely used option in many setups. Also, it's worth noting that many different third-party vendors provide commercial support and extensions to cover some specific use cases (such as clustering and monitoring).
H2 (https://www.h2database.com/) is an open source database written in Java. We played with this technology in Chapter 7, Exploring Middleware and Frameworks. It's very interesting to use because, being written in Java and released as a .jar file, it's easy to use it in an in-memory setup as part of the development process of Java applications.

This includes scenarios such as embedding the database as part of a development pipeline or a Maven task, when it can be programmatically destroyed, created, and launched any time you want. This makes it particularly useful in testing scenarios. Despite more complex setups being available (such as client servers), H2 is usually considered unsuitable for production usage. The most common use case, other than testing and development, is to ship it embedded with applications in order to provide a demo mode when an application is first started, suggesting that a different database should be set up and used before going into production.

SQLite (https://www.sqlite.org/) is another type of embeddable database. In contrast with H2, it's written in the C language and does not offer any setup other than embedded. Moreover, SQLite lacks some features (for example, it doesn't support some advanced features of SQL). However, due to its robustness and exceptional performance, SQLite is widely used in production environments. The most widespread use case is to embed it as part of a client application. Many web browsers (such as Firefox and Chrome) and desktop applications (such as Adobe Photoshop) are known to use SQLite to store information. It's also widely used in Android applications.

Now that we have seen a brief selection of commonly used databases, let's have a look at the use cases where it's beneficial to use a relational database and when other options would be better.

Advantages and disadvantages of relational databases

Transactionality is the key feature of relational databases and is one of the advantages of using the technology. While other storage technologies can be configured to offer features similar to ACID transactions, if you need to reliably store structured data consistently, it's likely that a relational database is your best bet, both from a performance and a functionality standpoint. Moreover, through the SQL language, databases offer an expressive way to retrieve, combine, and manipulate data, which is critical for many use cases.

Of course, there are downsides too. A database needs a rigid structure to be defined upfront for tables, relations, and constraints (that's pretty much essential and inherent to the technology). Later changes are of course possible, but they can have a lot of side effects (typically in terms of performance and potential constraint violations), and for this reason, they are impactful and expensive. On the other hand, we will see that alternative technologies (such as NoSQL storage) can implement changes in the data structure more easily.

For this reason, a relational database may not be suitable in cases where we don't exactly know the shape of the data objects we are going to store. Another potential issue is that, given the complexity and rigidity of the technology, you may end up with performance and functional issues, which are not always easy to troubleshoot.

A typical example relates to complex queries. A relational database typically uses indexes to achieve better performance (each specific implementation may use different techniques, but the core concepts are often the same). Indexes must be maintained over time, with operations such as defragmentation and other similar ones (depending on each specific database implementation). If we fail to properly perform such maintenances, this may end up impacting heavily on the performance. And even if our indexes are working correctly, complex queries may still perform poorly.

This is because, in most practical implementations, you will need to combine and filter data from many different tables (an operation generally known as a join). These operations may be interpreted in many different ways by databases that will try to optimize the query times but will not guarantee good results in every case (especially when many tables and rows are involved).

Moreover, when doing complex queries, you may end up not correctly using the indexes, and small changes in a working query may put you in the same situation. For this reason, my suggestion is, in complex application environments, to make sure to always double-check your queries in advance with the database administrators, who are likely to have tools and experience for identifying potential issues before they slip into production environments.

As we have seen in this section, relational databases, while not being the most modern option, are still a very widespread and useful technology for storing data, especially when you have requirements regarding data integrity and structure. However, this comes at the cost of needing to define the data structure upfront and in having some discipline in the maintenance and usage of the database.

You should also consider that, sometimes, relational databases may simply be overkill for simple use cases, where you just need simple queries and maybe not even persistence. We are going to discuss this scenario in the next section.

Introducing key/value stores

There are scenarios in which you simply need temporary storage and are going to access it in a simple way, such as by a known unique key, which will be associated with your object. This scenario is the best for key/value stores. Within this concept, you can find a lot of different implementations, which usually share some common features. The basic one is the access model – almost every key/value store provides APIs for retrieving data by using a key. This is basically the same mechanism as hash tables in Java, which guarantee maximum performance. Data retrieved in this way can be serialized in many different ways. The most basic way, for simple values, is strings, but Protobuf is another common choice (see Chapter 8, Designing Application Integration and Business Automation, where we discussed this and other serialization technologies).

A key/value store may not offer persistent storage options, as that is not the typical use case. Data is simply kept in memory to be optimized for performance. Modern implementations, however, compromise by serializing data on disk or in an external store (such as a relational database). This is commonly done asynchronously to reduce the impact on access and save times.

Whether the technology you are using is providing persistent storage or not, there are other features for enhancing the reliability of a system. The most common one is based on data replication. Basically, you will have more than one system (also called nodes) running in a clustered way (meaning that they are talking to each other). Such nodes may be running on the same machine or, better yet, in different locations (to increase the reliability even more).

Then, the technology running your key/value store may be configured to propagate each change (adding, removing, or modifying data) into a number of different nodes (optionally, all of them). In this way, in case of the failure of a node, your data will still be present in one or more other nodes. This replication can be done synchronously (reducing the possibility of data loss but increasing the latency of each write operation) or asynchronously (the other way around).

In the upcoming sections, we are going to see some common scenarios relating to caching data and the life cycle of records stored in the key/value store. Let's start looking at some techniques to implement data caching.

Data caching techniques

A typical use case for key/value stores is caching. You can use a cache as a centralized location to store disposable data that's quickly accessible from your applications. Such data is typically considered disposable because it can be retrieved in other ways (such as from a relational database) if the key/value store is unavailable or doesn't have the data.

So, in an average case (sometimes referred to as a cache hit), you will have better performance and will avoid going into other storage (such as relational databases), which may be slow, overloaded, or expensive to access. In a worst-case scenario (sometimes referred to as a cache miss), you will still have other ways to access your data.

Some common scenarios are as follows:

Cache aside: The key/value store is considered part of the application, which will decide programmatically which data should be stored on it, which data will go into persistent storage (such as a database), and how to keep the two in sync. This is, of course, the scenario providing the maximum flexibility, but it may be complex to manage.
Read-through and write-through: The synchronization between the key/value store and the persistent storage is done by the key/value store itself. This can be only for read operations (read-through), only for write operations (write-through), or for both. What happens from a practical point of view is that the application interacts with the key/value store only. Each change in the store is then propagated to the persistent storage.
Read-behind and write-behind: Basically, this is the same as read-through and write-through, but the sync with the persistent storage is not completed immediately (it's asynchronous). Of course, some inconsistency may happen, especially if you have other applications accessing the persistent storage directly, which may see incorrect or old data.
Write-around: In this scenario, your application reads from the key/value store (by using a read-through or read-behind approach) and directly writes on the persistence store, or maybe other applications perform the write on the persistence store. Of course, this scenario can be dangerous, as your application may end up reading incorrect things on the key/value store.

This scenario can be managed by notifying the key/value store about any change occurring in the persistent storage. This can be done by the application writing data, or it can be done directly by the persistent storage (if it is a feature provided by the technology) using a pattern known as change data capture. The key/value store may then decide to update the changed data or simply delete it from the cached view (forcing a retrieve from the persistent store when your application will look again for the same key).

Another common topic when talking about key/value stores is the life cycle of the data.

Data life cycle

Since they use memory heavily, with huge datasets you may want to avoid having everything in memory, especially if the access patterns are identifiable (for example, you can foresee with reasonable accuracy which data will be accessed by your application). Common patterns for deciding what to keep in memory and what to delete are as follows:

Least recently used: The system keeps track of the time of last access for each record and ditches the records that haven't been accessed for a set amount of time.
Tenure: A variant of the previous scenario that simply uses the creation time instead of the last access time.
Least frequently used: The system keeps a count of how many times a record is accessed and then, when it needs to free up some memory, it will delete the least accessed records.
Most recently used: The opposite of least recently used, this deletes the most recently accessed records. This can be particularly useful in some scenarios, such as when it's unlikely that the same key will be accessed twice in a short amount of time.

Key/value stores lack a standard language, such as SQL. It's also for this reason that key/value stores are a big family, including many different products and libraries, often offering more features than just key/value management. In the next section, we are going to see a few of the most famous implementations of key/value stores.

Commonly used implementations of key/value stores

As previously mentioned, it's not easy to build a list of key/value store technology implementations. As we will see in the next few sections, this way of operating a database is considered to be a subcategory of a bigger family of storage systems, called NoSQL databases, offering more options and alternatives than just key/value storage. However, for the purpose of this section, let's see a list of what is commonly used in terms of key/value stores:

Redis is likely the most famous key/value store currently available. It's open source, and one of the reasons for its success is that, despite offering a lot of advanced features and tunings, it just works well enough in its default setting, making adopting it very easy. It provides client libraries for almost every language, including Java. It offers a lot of advanced features, such as clustering, transactions, and embedded scripting (using the Lua language). It can operate on in-memory only, or persist the data on the filesystem using a configurable approach in order to balance performance impact and reliability.
Oracle Coherence is a widely used commercial key/value storage. It's particularly used in conjunction with other Oracle products, in particular with the database. It offers a wide range of features, including a complete set of APIs and a custom query language. Since 2020, a community edition of Coherence is available as open source software.
Memcached is a simple key/value store that is light and easy to operate. However, it lacks some features, such as persistence. Moreover, it provides only the cache-aside use case, so other scenarios must be implemented manually.
Infinispan is an open source key/value store that provides features such as persistence, events, querying, and caching. It's worth noting that Infinispan can be used both in an embedded and a client/server setup. In the embedded setup, Infinispan is part of the WildFly JEE application server, providing caching services to Java Enterprise applications.

Now that we have seen some widespread key/value stores, let's see when they are a good fit and when they are not.

The pros and cons of key/value stores

The most important advantage of key/value stores is the performance. The access time can be incredibly fast, especially when used without any persistent storage (in-memory only). This makes them particularly suitable for low-latency applications. Another advantage is simplicity, both from an architectural and a usage point of view.

Architecturally speaking, if your use case doesn't require clustering and other complex settings, a key/value store can be as simple as a single application exposing an API to retrieve and store records. From a usage point of view, most use cases can be implemented with primitives as simple as get, put, and delete. However, some of these points can become limitations of key/value stores, especially when you have different requirements. If your application needs to be reliable (as in losing as little data as possible when there's a failure), you may end up with complex multi-node setups and persistence techniques. This may, in turn, mean that in some cases, you can have inconsistency in data that may need to be managed from an application point of view.

Another common issue is that, usually, data is not structured in key/value stores. This means that it is only possible to retrieve data searching by key (or at least, that's the most appropriate scenario). While some implementations allow it, it can be hard, performance-intensive, or in some cases impossible to retrieve data with complex queries on the object values, in contrast with what you can do with SQL in relational databases.

In this section, we have covered the basics of data caching and key/value stores. Such techniques are increasingly used in enterprise environments, for both their positive impact on performances and their scalability, which fit well with cloud-native architectures. Topics such as data caching techniques and the life cycles of objects are common considerations to be made when adopting key/value stores.

Key/value stores are considered to be part of a broader family of storage technologies that are alternatives to relational databases, called NoSQL. In the next section, we will go into more detail about this technology.

Exploring NoSQL repositories

NoSQL is an umbrella term comprising a number of very different data storage technologies. The term was coined mostly for marketing purposes in order to distinguish them from relational databases. Some NoSQL databases even support SQL-like query languages. NoSQL databases claim to outdo relational databases in terms of performance. However, this assurance only exists because of some compromises, namely the lack of some features, usually in terms of transactionality and reliability. But to discuss these limitations, it is worth having an overview of the CAP theorem.

The CAP theorem

The CAP theorem was theorized by Eric Brewer in 1998 and formally proven valid in 2002 by Seth Gilbert and Nancy Lynch. It refers to a distributed data store, regardless of the underlying technology, so it's also applicable to relational databases when instantiated in a multi-server setup (so, running in two or more different processes, communicating through a network, for clustering and high-availability purposes). The theorem focuses on the concept of a network split, when the system becomes partitioned into two (or more) subsets that are unable to communicate with each other due to connectivity loss.

The CAP theorem describes three core characteristics of distributed data stores:

Consistency refers to keeping the stored data complete, updated, and formally correct.
Availability refers to providing access to all the functionalities of the data store, especially the reads and writes of the data itself.
Partition tolerance refers to the system functioning correctly, even in a case of network failure between servers.

The CAP theorem states that, when a partition occurs, you can only preserve consistency or availability. While a mathematical explanation is available (and beyond the scope of this book), the underlying idea can be understood easily:

If a system preserves availability, it may be that two conflicting operations (such as two writes with two different values) arrive in two different partitions of the system (such as two servers, unable to communicate between each other). With availability in mind, both servers will accept the operation, and the end result will be data being inconsistent.
If a system preserves consistency, in case of a network split, it cannot accept operations that will change the status of the data (to avoid the risk of conflicts damaging the data consistency); hence, it will sacrifice availability.

However, it's worth noticing that this theorem, while being the basis for understanding the distributed data store limits, must be considered and contextualized in each particular scenario. In many enterprise contexts, it is possible to make the event of a network split extremely unlikely (for example, by providing multiple network connections between each server).

Moreover, it's common to have mechanisms to elect a primary partition when there's a network split. This basically means that if you are able to define which part of the cluster is primary (typically, the one with the greater number of survival nodes, and this is why it's usually recommended to have an odd number of nodes), this partition can keep working as usual, while the remaining partition can shut down or switch to a degraded mode (such as read-only). So, basically, it's crucial to understand the basics of the CAP theorem, but it's also important to understand that there are a number of ways to work around the consequences.

This is exactly the reasoning behind NoSQL databases. These databases shift their point of view, stretching a bit over the CAP capabilities. This means that, while traditional relational databases focus on consistency and availability, they are often unreliable to operate in a heavily distributed fashion. Conversely, NoSQL databases can operate better in horizontally distributed architectures, favoring scalability, throughput, and performance at the expense of availability (as we saw, becoming read-only when there are network partitions) or consistency (not providing ACID transaction capabilities).

And this brings us to another common point of NoSQL stores – the eventual consistency.

Indeed, most NoSQL stores, while not providing full transactionality (compared to relational databases) can still offer some data integrity by using the pattern of eventual consistency. Digging into the details and impacts of this pattern would require a lot of time. For the sake of this section, it's sufficient to consider that a system implementing eventual consistency may have some periods of time in which data is not coherent (in particular, enquiring for the same data on two different nodes can lead to two different results).

With that said, it's usually possible to tune a NoSQL store in order to preserve consistency and provide full transactionality as a traditional relational database does. But in my personal experience, the impacts in terms of reduced performance and availability are not a worthwhile compromise. In other words, if you are looking for transactionality and data consistency, it's usually better to rely on relational databases.

With that said, let's have an overview of the different NoSQL database categories.

NoSQL database categories

As we discussed in the previous sections, NoSQL is an umbrella term. There are a number of different categories of NoSQL stores:

Key/value stores: This is the easiest one, as we have already discussed the characteristics of this technology. As should be clear by now, key/value stores share some core characteristics with NoSQL databases – they are generally designed to be horizontally scalable, to focus on performance over transactionality, and to lack full SQL compliance.
Document stores: This is one of the most widespread categories of NoSQL databases. The core concept of a document store is that instead of rows, it stores documents, serialized into various formats (commonly JSON and XML). This often gives the flexibility of storing documents with a different set of fields or, in other words, it avoids defining a strict schema in advance for the data we are going to store. Documents then can be searched by their contents. Some notable examples of document stores include MongoDB, Couchbase, and Elasticsearch.
Graph databases: This category of stores is modeled around the concept of a graph. It provides storage and querying capabilities optimized around graph concepts, such as nodes and vertex. In this way, concepts such as roads, links, and social relationships can be modeled, stored, and retrieved easily and efficiently. A famous implementation of a graph database is Neo4j.
Wide-column databases: These stores are similar to relational databases, except that in a table, each row can have a different set of fields in terms of the number, name, and type of each one. Two known implementations of wide-column databases are Apache Cassandra and Apache Accumulo.

Of course, as you can imagine, there is a lot more to say about NoSQL databases. I hope the pointers I gave in this section will help you quickly understand the major features of NoSQL databases, and I hope one of the examples I've provided will be useful for your software architecture. In the next section, we are going to have a look at filesystem storage.

Looking at filesystem storage

Filesystems are a bit of a borderline concept when it comes to data storage systems. To be clear, filesystem storage is a barely structured system providing APIs, schemas, and advanced features, like the other storage systems that we have seen so far. However, it is still a very relevant layer in many applications, and there are some new storage infrastructures that provide advanced features, so I think it's worth having a quick overview of some core concepts.

Filesystem storage should not be an alien concept to most of us. It is a persistent storage system backed by specific hardware (spinning or solid-state disks). There are many different filesystems, which can be considered the protocol used to abstract the read and write operations from and to such specific hardware. Other than creating, updating, and deleting files, and the arrangement of these files into folders, filesystems can provide other advanced features, such as journaling (to reduce the risk of data corruption) and locking (in order to provide exclusive access to files).

Some common filesystems are the New Technology File System (NTFS) (used in Windows environments) and the Extended File System (ext) (used in Linux environments). However, these filesystems are designed for working on a single machine. A more important concept relates to the filesystems that allow interactions between different systems. One such widespread implementation is networked filesystems, which is a family of filesystem protocols providing access to files and directories over a network. The most notable example here is NFS, which is a protocol that provides multi-server access to a shared filesystem. The File Transfer Protocol (FTP) and the SSH File Transfer Protocol (SFTP) are other famous examples, and even if they are outdated, they are still widely used.

A recent addition to the family of network storage systems is Amazon S3. While it's technically an object filesystem, it's a way to interact with Amazon facilities using APIs in order to store and retrieve files. It started as a proprietary implementation for providing filesystem services on AWS infrastructure over the internet; since then, S3 has become a standard, and there are a lot of other implementations, both open source and commercial, aiming to provide S3-compliant storage on-premises and in the cloud.

The advantages and disadvantages of filesystems

It's hard to talk about the disadvantages of filesystems because they are an essential requirement in every application, and it will stay like this for a long time. However, it's important to contextualize and think logically about the pros and cons of filesystems to better understand where to use them.

Application interaction over shared filesystems is particularly convenient when it comes to exchanging large amounts of data. In banking systems (especially legacy ones), it's common to exchange large numbers of operations (such as payments) to be performed in batches, in the form of huge .csv files. The advantage is that the files can be safely chunked, signed, and efficiently transferred over a network.

On the other hand, filesystems don't usually offer native indexing and full-text search, so these capabilities must be implemented on top. Moreover, filesystems (especially networked filesystems) can perform badly, especially when it comes to concurrent access and the locking of files.

With this section, we have completed our overview of storage systems.

In the next section, we are going to see how, in modern architecture, it is common to use more than one storage solution to address different use cases with the most suitable technology.

Modern approaches – a multi-tier storage strategy

In the final section of the chapter, we'll be exploring a concept that may seem obvious, but it's still worth mentioning. Modern architecture tends to use multiple data storage solutions, and I think that this could be a particularly interesting solution.

In the past, it was common to start by defining a persistence strategy (typically on a relational database or on another legacy persistence system) and build the application functionalities around it. This is no longer the case. Cloud-native technologies, through microservices, developed the idea that each microservice should own its own data, and we can extend this concept in that each microservice could choose its own persistent storage technology. This is better suited for the particular characteristics of that business domain and the related use cases. Some services may need to focus on performance, while others will have a strong need for transactionality and data consistency.

However, even if you are dealing with a less innovative architecture, it's still worthwhile evaluating different ideas around data persistence solutions. Here are some discussion points about it:

Relational databases are your best bet when data is structured upfront and such a structure doesn't change very often. Moreover, if you will need ACID-compliant transactions, relational databases are generally the most performant solution.
Key/value stores, especially in their in-memory setup, are useful in a number of use cases. The more common scenarios include the storage of user sessions, which will demand high performance (as it's related to web and mobile use cases, where there is heavy user interaction and high expectation in terms of availability) and consistency/reliability is less of an issue (in a worst-case scenario, the user will be logged out and will need to log in again). Another widely used scenario is database offloading – implementing some of the described scenarios (read-through, write-through, and so on) where cached entries will boost the overall performance and reduce the load on the database.
NoSQL databases can be used for scenarios particularly suited to the specific technology of choice. In particular, if some entities in our architecture have a variable or unstructured representation, they can be suitable for document repositories. Graph databases can be useful for other scenarios in which algorithms on graphs are needed (such as the shortest path calculation).
As previously mentioned, filesystems are almost always a fundamental infrastructure. They may be needed by some middleware (such as message brokers) for writing journals, and they can be used explicitly by an application as a data exchange area for large amounts of information (especially when dealing with legacy systems).

So, once again, choosing the right data storage technology can be crucial to have a performant and well-written application, and it's a common practice to rely on more than one technology to meet the different needs that different parts of our application will require.

Summary

In this chapter, we have seen an overview of different possibilities on the data layer, ranging from traditional SQL databases to more modern alternatives.

While most of us are already familiar with relational databases, we have had a useful examination of the pros and cons of using this technology. We then broadened our view with alternative, widespread storage technologies, such as key/value stores, NoSQL, and even filesystems.

Eventually, we looked at how the choice of a particular way of storing data may affect both the application design and the performance of our system. Indeed, in modern architecture, we may want to pick the right storage solution for each use case by choosing different solutions where needed.

In the next chapter, we are going to discuss some architectural cross-cutting concerns. Topics such as security, resilience, usability, and observability are crucial to successful application architecture and will be analyzed to see their impacts and best practices.

Table of Contents for
Chapter 11: Dealing with Data

Chapter 11: Dealing with Data

Exploring relational databases

Keys and relationships

Transactionality

Stored procedures

Commonly used implementations of relation databases

Advantages and disadvantages of relational databases

Introducing key/value stores

Data caching techniques

Data life cycle

Commonly used implementations of key/value stores

The pros and cons of key/value stores

Exploring NoSQL repositories

The CAP theorem

NoSQL database categories

Looking at filesystem storage

The advantages and disadvantages of filesystems

Modern approaches – a multi-tier storage strategy

Summary

Further reading

Table of Contents for Chapter 11: Dealing with Data

Create new playlist

Sign In

Sign Up

Chapter 11: Dealing with Data

Exploring relational databases

Keys and relationships

Transactionality

Stored procedures

Commonly used implementations of relation databases

Advantages and disadvantages of relational databases

Introducing key/value stores

Data caching techniques

Data life cycle

Commonly used implementations of key/value stores

The pros and cons of key/value stores

Exploring NoSQL repositories

The CAP theorem

NoSQL database categories

Looking at filesystem storage

The advantages and disadvantages of filesystems

Modern approaches – a multi-tier storage strategy

Summary

Further reading

Table of Contents for
Chapter 11: Dealing with Data