Using the right storage for the right need

For decades, organizations have been using a traditional relational database and trying to fit everything there, whether it is key/value-based user session data, unstructured log data, or analytics data for a data warehouse. However, the truth is, the relational database is meant for transaction data, and it doesn't work very well for other data types—it's like using a Swiss Army knife, which has multiple tools that work but to a limited capacity; if you want to build a house, then the screwdriver will not be able to perform a heavy lift. Similarly, for specific data needs, you should choose the right tool that can do the heavy lifting, and scale without compromising performance.

Solution architects need to consider multiple factors while choosing the data storage to match the right technology. Here are the important ones:

  • Durability requirement: How should data be stored to prevent data corruption?
  • Data availability: Which data storage system should be available to deliver data?
  • Latency requirement: How fast should the data be available?
  • Data throughput: What is the data read and write need?
  • Data size: What is the data storage requirement?
  • Data load: How many concurrent users need to be supported?
  • Data integrity: How to maintain the accuracy and consistency of data?
  • Data queries: What will be the nature of queries?

In the following table, you can see different types of data with examples and appropriate storage types to use. Technology decisions need to be made based on storage type, as shown here:

Data Type

Data Example

Storage Type

Storage Example

Transactional, structured schema

User order data, financial transaction

Relational database

Amazon RDS, Oracle, MySQL,

PostgreSQL, MariaDB

Microsoft SQL Server

Key-value pair, semi-structured, unstructured

User session data, application log, review, comments

NoSQL

Amazon DynamoDBMongoDB,

Apache HBaseApache Cassandra, Azure Tables,

Analytics

Sales data, Supply chain intelligence, Business flow

Data warehouse

IBM Netezza, Amazon Redshift,

Teradata, Greenplum, Google 

BigQuery

In-memory

User home page data, common dashboard

Cache

Redis cache, Amazon ElastiCache,

Memcached

Object

Image, video

File-based

SAN, Amazon S3, Azure Blob Storage, Google Storage

Block

Installable software

Block-based

NAS, Amazon EBS, Amazon EFS,

Azure Disk Storage

Streaming

IoT sensor data, clickstream data

Temporary storage for streaming data

Apache Kafka, Amazon Kinesis,

Spark StreamingApache Flink

Archive

Any kind of data

Archive storage

Amazon Glacier, magnetic tape storage, virtual tape library storage

Web storage

Static web contents such as images, videos, HTML pages

CDN

Amazon CloudFront, Akamai CDN, Azure CDN, Google CDN, Cloudflare

Search

Product search, content search

Search index store and query

Amazon Elastic Search, Apache Solr,

Apache Lucene

Data catalog

Table metadata, data about data

Meta-data store

AWS Glue, Hive metastore, Informatica data catalog, Collibra data catalog

Monitoring

System log, network log, audit log

Monitor dashboard and alert

Splunk, Amazon CloudWatch,

SumoLogic, Loggly

 

As you can see in the preceding table, there are various properties of data, such as structured, semi-structured, unstructured, key-value pair, streaming, and so on. Choosing the right storage helps to improve not only the performance of the application, but also its scalability. For example, you can store user session data in the NoSQL database, which will allow application servers to scale horizontally and maintain user sessions at the same time.

While choosing storage options, you need to consider the temperature of the data, which could be hot, warm, or cold:

  • For hot data, you are looking for sub-millisecond latency and required cache data storage. Some examples of hot data are stock trading and making product recommendations in runtime. 
  • For warm data, such as financial statement preparation or product performance reporting, you can live with the right amount of latency, from seconds to minutes, and you should use a data warehouse or a relational database.
  • For cold data, such as storing 3 years of financial records for audit purposes, you can plan latency in hours, and store it in archive storage.

Choosing the appropriate storage, as per the data temperature also saves costs in addition to achieving performance SLA. As any solution design revolves around handling the data, so a solution architect always needs to understand their data thoroughly and then choose the right technology.

In this section, we have covered a high-level view of data in order to get the idea of using the proper storage, as per data nature. You will learn more about data engineering in Chapter 13, Data Engineering and Machine Learning. Using the right tool for the right job helps to save costs and improve performance, so it's essential to choose the right data storage for the right need.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.142.156.202