S3 performance recommendations

The S3 service is a highly scalable environment but there are some guidelines that need to be followed to achieve maximum performance from the S3 backend. Your data in S3 is distributed according to the key or key name, which is the name the object identified by in the bucket. The key name determines the partition the data is sotred on within S3. The key can be just the filename, or it can have a prefix. As objects in S3 are grouped and stored in the backend according to their keys, we can expect to achieve at least 3,500 PUT/POST/DELETE and 5,500 GET requests per second per prefix in a bucket. So, if we want to achieve more performance from S3, we need to address multiple partitions at the same time, by distributing the keys across the partitions. In this way we are able to get unlimited performance from S3.

So, let's imagine we have a service that stores images from photographers; they take the raw images and drop them into an S3 bucket. They take hundreds or thousands of photos per day and the images all have the following format: IMG_yyyymmdd_hhmmss.jpg. Here are a few examples:

Sam copies the following images:
IMG_20181110_151628.jpeg
IMG_20181110_151745.jpeg
IMG_20181110_151823.jpeg

Peter copies the following images:
IMG_20181110_180506.jpeg
IMG_20181110_180904.jpeg
IMG_20181110_190712.jpeg

All of these images will be stored very close by, as their keys are very similar, and all of the photographers will be sharing the 3,500 PUT requests that can be issued to one prefix. To give each photographer the maximum performance, you would want to distribute these image names by adding prefixes to the key. You can add a prefix that will group the keys together by creating a folder in the S3 management console and designating each photographer a separate directory, which will distribute the images across multiple keys. This also helps protect images with the same name by two photographers from being overwritten or versioned. It also increases security as we can give access to each photographer based solely on the key prefix.

So, now Sam would be working on the prefix sam/ and Peter on peter/, and the uploaded images would be seen as follows:

sam/IMG_20181110_151628.jpeg
sam/IMG_20181110_151745.jpeg
sam/IMG_20181110_151823.jpeg
peter/IMG_20181110_180506.jpeg
peter/20181110_180904.jpeg
peter/20181110_190712.jpeg

By adding a prefix, we have doubled the performance of the environment. But, now we want a parallel processing environment to access the photographs from a distributed set of EC2 spot instances, and we want each instance to get maximum performance over these images. What we would want to do in this case is add a random prefix to the existing image prefix to distribute the files even further.

Let's say that once a day, we pick up the images and process them. Before processing, we can simply copy the images from the upload bucket where they are prefixed by username and move them to a processing bucket where the images are given a random prefix. After each image is copied, the original can be deleted from the upload bucket to save on space.

In this example, we will put an eight-character random prefix to the images. Once copied to the processing bucket, the image list will look like this:

2kf39s5f/sam/IMG_20181110_151628.jpeg
1ks9kdv8/sam/IMG_20181110_151745.jpeg
o9ues833/sam/IMG_20181110_151823.jpeg
kc8shj3d/peter/IMG_20181110_180506.jpeg
n8sk83ld/peter/20181110_180904.jpeg
u379di3r/peter/20181110_190712.jpeg

We have now even further distributed the images across the key prefixes, while also increasing the GET request capacity available for the processing of the images.

Whenever you are uploading large files to the S3 environment, you might see that you are not utilizing the full bandwidth of your uplink to S3. This can happen when you have hit the bandwidth limit for a single upload. To mitigate this issue, enable multipart uploads for your large files and send several parts at once. S3 will be able to handle much more traffic if you send a single file in multiple parts. Any file over 100MB should be considered as a good candidate for a multipart upload. You can have up to 10,000 parts in a multipart file, meaning you can easily distribute the traffic to S3 with up to 10,000 concurrent PUT requests. The PUT limits per key prefix per second do apply when using multipart uploads, so make sure you distribute the file wisely.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.143.205.136