Storing data

One of the most common mistakes made when setting up storage for a big data environment is using one solution, frequently an RDBMS, to handle all of your data storage requirements. You will have many tools available, but none of them are optimized for the task they need to complete. One single solution is not necessarily the best for all of your needs; the best solution for your environment might be a combination of storage solutions that carefully balance latency with the cost. An ideal storage solution uses the right tool for the right job. Choosing a data store depends upon various factors:

How structured is your data? Does it adhere to a specific, well-formed schema, as is the case with Apache web logs (logs are generally not well structured and so not suitable for relational databases), standardized data protocols, and contractual interfaces? Is it completely arbitrary binary data, as in the cases of images, audio, video, and PDF documents? Or, is it semi-structured with a general structure but with potentially high variability across the records, as in the case of JSON or CSV?
How quickly does new data need to be available for querying (temperature)? Is it a real-time scenario, where decisions are made as new records stream in, such as with campaign managers making adjustments based on conversion rates or a website making product recommendations based on user-behavior similarity? Is it a daily, weekly, or monthly batch scenario, such as in model training, financial statement preparation, or product performance reporting? Or is it somewhere in between, such as with user engagement emails, where it doesn't require real-time action, but you can have a buffer of a few minutes or even a few hours between the user action and the touchpoint?
The size of the data ingest: Is the data ingested record by record as data comes in, such as with JSON payloads from REST APIs that measure at just a few KBs at best? Is it a large batch of records arriving all at once, such as with system integrations and third-party data feeds? Or is it somewhere in between, such as with a few micro-batches of clickstream data aggregated together for more efficient processing?
Total volume of data and its growth rate: Are you in the realm of GBs and TBs, or do you intend to store PBs or even exabytes (EB)? How much of this data is required for your specific analytics use cases? Do the majority of your queries only require a specific rolling window of time? Or, do you need a mechanism to query the entirety of your historical dataset?

What the cost will be to store and query the data in any particular location: When it comes to any computing environment, we generally see a triangle of constraints between performance, resiliency, and low cost. The better the performance and the higher the resilience that you want your storage to have, the more expensive it will tend to be. You may wish to have quick queries over petabytes of data, which is the kind of environment that Redshift can provide, but decide to settle on querying TBs of data compressed in the Parquet format using Athena to meet your cost requirements.

Finally, what type of analytic queries will run against the data? Will it be powering a dashboard with a fixed set of metrics and drill-down? Will it participate in large numerical aggregations rolled up by various business dimensions? Or, will it be used for diagnostics, leveraging string tokenization for full-text searching and pattern analysis? The following diagram combines multiple factors related to your data and the storage choice associated with it:

Understanding data storage

When you determine all characteristics of your data and understand the data structure, you can then assess which solution you need to use for your data storage. Let's learn about the various solutions for storing data.

Table of Contents for Storing data

Create new playlist

Sign In

Sign Up

Table of Contents for
Storing data