Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 62. Small Files in a Big Data World

Adi Polak

No matter whether your data pipelines are handling real-time event-driven streams, near real-time data, or batch processing jobs, when you work with a massive amount of data made up of small files, you will face the small files nightmare.

What Are Small Files, and Why Are They a Problem?

A small file is significantly smaller than the storage block size. Yes, even with object stores such as Amazon S3 and Azure Blob, there is minimum block size. A significantly smaller file can result in wasted space on the disk, since storage is optimized by block size.

To understand why, let’s first explore how reading and writing work. For read and write operations, there is a dedicated API call. For write requests, the storage writes three components:

The data itself
Metadata with descriptive properties for indexing and data management
A globally unique identifier for identification in a distributed system

More objects stored means extra unique identifiers and extra I/O calls for creating, writing, and closing the metadata and data files.

To read the data stored, we use an API call to retrieve a specific object. The server checks for the object’s unique identifier on the storage server-side and finds the actual address and disk to read from. A hierarchy of unique identifiers helps us navigate the exabyte object storage capabilities. The more unique identifiers (objects) there are, the longer the search will take. This will result in lower throughput due to search time and disk seeks required.

Furthermore, this translates into an overhead of milliseconds for each object store operation since the server translates each API into remote procedure calls (RPCs).

When the server reaches its limits or takes too much time to retrieve the data, your API call for reading or writing will receive an error code such as 503 (server busy) or 500 (operation timed out).

Dealing with those errors is excruciating with Amazon Athena, Azure Synapse, and Apache Spark tools since they abstract the storage calls for us.

Why Does It Happen?

Let’s look at three general cases resulting in small files.

First, during the ingestion procedure, event streams originating from Internet of Things (IoT) devices, servers, or applications are translated into kilobyte-scale JSON files. Writing them to object storage without joining/bundling and compressing multiple files together will result in many small files.

Second, small files can result from parallelized Apache Spark jobs. With either batch- or stream-processing jobs, a new file gets written per write task; more Spark writing tasks means more files. Having too many parallel tasks for the size of the data can result in many small files. Data skew can have a similar effect: if most of the data is routed to one or a few writers, many writers are left with only small chunks of data to process and write, and each of these chunks gets written to a small file.

Third, over-partitioned Hive tables can result from collecting data in tables daily or hourly. As a general approach, if a Hive partition is smaller than ~256 MB, consider reviewing the partition design and tweaking the Hive merge file configurations by using hive.merge.smallfiles.avgsize and hive.merge.size.per.task.

Detect and Mitigate

To solve the problem of small files, first identify the root cause. Is it the ingestion procedure, or offline batch processing? Check out your Hive partition file sizes, Spark job writers in the History Server UI, and the actual files’ sizes on ingestion.

If optimizing the ingestion procedure to generate bigger files won’t solve the problem, look at the Spark repartition functionality (reparti⁠⁠tion_by_range) and coalesce functionality to assemble small files together. Spark 3.0 provides partitioning hints to suggest a specific assembling strategy to the Spark SQL engine.

Conclusion

Be aware of small files when designing data pipelines. Try to avoid them, but know you can fix them too!

References

Microsoft: “Scalability and Performance Targets for Blob Storage”
AWS: “Optimizing Amazon S3 Performance”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.

Table of Contents for 62. Small Files in a Big Data World

Create new playlist

Sign In

Sign Up