Search in book...
Toggle Font Controls
Create new playlist

Name your new playlist

Playlist description (optional)
Sign In

Email address

Password

Forgot Password?

or

Continue with Facebook

Continue with Google
Sign Up

Full Name

Email address

Confirm Email Address

Password

or

Continue with Facebook

Continue with Google

Chapter 3
AWS Storage

THE AWS CERTIFIED SOLUTIONS ARCHITECT ASSOCIATE EXAM OBJECTIVES COVERED IN THIS CHAPTER MAY INCLUDE, BUT ARE NOT LIMITED TO, THE FOLLOWING:

Domain 1: Design Resilient Architectures
- 1.1 Design a multi‐tier architecture solution
- 1.2 Design highly available and/or fault‐tolerant architectures
- 1.4 Choose appropriate resilient storage
Domain 2: Design High‐Performing Architectures
- 2.1 Identify elastic and scalable compute solutions for a workload
- 2.2 Select high‐performing and scalable storage solutions for a workload
Domain 3: Design Secure Applications and Architectures
- 3.1 Design secure access to AWS resources
- 3.2 Design secure application tiers
- 3.3 Select appropriate data security options
Domain 4: Design Cost‐Optimized Architectures
- 4.1 Identify cost‐effective storage solutions

Introduction

Amazon Simple Storage Service (S3) is where individuals, applications, and a long list of AWS services keep their data. It's an excellent platform for the following:

Maintaining backup archives, log files, and disaster recovery images
Running analytics on big data at rest
Hosting static websites

S3 provides inexpensive and reliable storage that can, if necessary, be closely integrated with operations running within or external to AWS.

This isn't the same as the operating system volumes you learned about in the previous chapter; those are kept on the block storage volumes driving your EC2 instances. S3, by contrast, provides a space for effectively unlimited object storage.

What's the difference between object and block storage? With block‐level storage, data on a raw physical storage device is divided into individual blocks whose use is managed by a file system. NTFS is a common filesystem used by Windows, and Linux might use Btrfs or ext4. The filesystem, on behalf of the installed OS, is responsible for allocating space for the files and data that are saved to the underlying device and for providing access whenever the OS needs to read some data.

An object storage system like S3, on the other hand, provides what you can think of as a flat surface on which to store your data. This simple design avoids some of the OS‐related complications of block storage and allows anyone easy access to any amount of professionally designed and maintained storage capacity.

When you write files to S3, they're stored along with up to 2 KB of metadata. The metadata is made up of keys that establish system details like data permissions and the appearance of a filesystem location within nested buckets.

In this chapter, you're going to learn the following:

How S3 objects are saved, managed, and accessed
How to choose from among the various classes of storage to get the right balance of durability, availability, and cost
How to manage long‐term data storage lifecycles by incorporating Amazon Glacier into your design
What other AWS services exist to help you with your data storage and access operations

S3 Service Architecture

You organize your S3 files into buckets. By default, you're allowed to create as many as 100 buckets for each of your AWS accounts. As with other AWS services, you can ask AWS to raise that limit.

Although an S3 bucket and its contents exist only within a single AWS region, the name you choose for your bucket must be globally unique within the entire S3 system. There's some logic to this; you'll often want your data located in a particular geographical region to satisfy operational or regulatory needs. But at the same time, being able to reference a bucket without having to specify its region simplifies the process.

Here is the URL you would use to access a file called filename that's in a bucket called bucketname over HTTP:

s3.amazonaws.com/bucketname/filename

Naturally, this assumes you'll be able to satisfy the object's permissions requirements.

This is how that same file would be addressed using the AWS CLI:

s3://bucketname/filename

Prefixes and Delimiters

As you've seen, S3 stores objects within a bucket on a flat surface without subfolder hierarchies. However, you can use prefixes and delimiters to give your buckets the appearance of a more structured organization.

A prefix is a common text string that indicates an organization level. For example, the word contracts when followed by the delimiter / would tell S3 to treat a file with a name like contracts/acme.pdf as an object that should be grouped together with a second file named contracts/dynamic.pdf.

S3 recognizes folder/directory structures as they're uploaded and emulates their hierarchical design within the bucket, automatically converting slashes to delimiters. That's why you'll see the correct folders whenever you view your S3‐based objects through the console or the API.

Working with Large Objects

Although there's no theoretical limit to the total amount of data you can store within a bucket, a single object may be no larger than 5 TB. Individual uploads can be no larger than 5 GB. To reduce the risk of data loss or aborted uploads, AWS recommends that you use a feature called Multipart Upload for any object larger than 100 MB.

As the name suggests, Multipart Upload breaks a large object into multiple smaller parts and transmits them individually to their S3 target. If one transmission should fail, it can be repeated without impacting the others.

Multipart Upload will be used automatically when the upload is initiated by the AWS CLI or a high‐level API, but you'll need to manually break up your object if you're working with a low‐level API.

An application programming interface (API) is a programmatic interface through which operations can be run through code or from the command line. AWS maintains APIs as the primary method of administration for each of its services. AWS provides low‐level APIs for cases when your S3 uploads require hands‐on customization, and it provides high‐level APIs for operations that can be more readily automated. This page contains specifics:

docs.aws.amazon.com/AmazonS3/latest/dev/uploadobjusingmpu.html

If you need to transfer large files to an S3 bucket, the Amazon S3 Transfer Acceleration configuration can speed things up. When a bucket is configured to use Transfer Acceleration, uploads are routed through geographically nearby AWS edge locations and, from there, routed using Amazon's internal network.

You can find out whether Transfer Acceleration would actually improve transfer speeds between your location and a particular AWS region by using the Amazon S3 Transfer Acceleration Speed Comparison tool (s3-accelerate-speedtest.s3-accelerate.amazonaws.com/en/accelerate-speed-comparsion.html). If your transfers are, in fact, good candidates for Transfer Acceleration, you should enable the setting in your bucket. You can then use special endpoint domain names (like bucketname.s3-accelerate.amazonaws.com) for your transfers.

Work through Exercise 3.1 and create your own bucket.

Encryption

Unless it's intended to be publicly available—perhaps as part of a website—data stored on S3 should always be encrypted. You can use encryption keys to protect your data while it's at rest within S3 and—by using only Amazon's encrypted API endpoints for data transfers—protect data during its journeys between S3 and other locations.

Data at rest can be protected using either server‐side or client‐side encryption.

Server‐Side Encryption

The “server‐side” here is the S3 platform, and it involves having AWS encrypt your data objects as they're saved to disk and decrypt them when you send properly authenticated requests for retrieval.

You can use one of three encryption options:

Server‐Side Encryption with Amazon S3‐Managed Keys (SSE‐S3), where AWS uses its own enterprise‐standard keys to manage every step of the encryption and decryption process
Server‐Side Encryption with AWS KMS‐Managed Keys (SSE‐KMS), where, beyond the SSE‐S3 features, the use of an envelope key is added along with a full audit trail for tracking key usage. You can optionally import your own keys through the AWS KMS service.
Server‐Side Encryption with Customer‐Provided Keys (SSE‐C), which lets you provide your own keys for S3 to apply to its encryption

Client‐Side Encryption

It's also possible to encrypt data before it's transferred to S3. This can be done using an AWS KMS–Managed Customer Master Key (CMK), which produces a unique key for each object before it's uploaded. You can also use a Client‐Side Master Key, which you provide through the Amazon S3 encryption client.

Server‐side encryption can greatly reduce the complexity of the process and is often preferred. Nevertheless, in some cases, your company (or regulatory oversight body) might require that you maintain full control over your encryption keys, leaving client‐side as the only option.

Logging

Tracking S3 events to log files is disabled by default—S3 buckets can see a lot of activity, and not every use case justifies the log data that S3 can generate.

When you enable logging, you'll need to specify both a source bucket (the bucket whose activity you're tracking) and a target bucket (the bucket to which you'd like the logs saved). Optionally, you can also specify delimiters and prefixes (such as the creation date or time) to make it easier to identify and organize logs from multiple source buckets that are saved to a single target bucket.

S3‐generated logs, which sometimes appear only after a short delay, will contain basic operation details, including the following:

The account and IP address of the requestor
The source bucket name
The action that was requested (GET, PUT, POST, DELETE, etc.)
The time the request was issued
The response status (including error code)

S3 buckets are also used by other AWS services—including CloudWatch and CloudTrail—to store their logs or other objects (like EBS Snapshots).

S3 Durability and Availability

S3 offers more than one class of storage for your objects. The class you choose will depend on how critical it is that the data survives no matter what (durability), how quickly you might need to retrieve it (availability), and how much money you have to spend.

Durability

S3 measures durability as a percentage. For instance, the 99.999999999 percent durability guarantee for most S3 classes and Amazon Glacier is as follows:

… corresponds to an average annual expected loss of 0.000000001% of objects. For example, if you store 10,000,000 objects with Amazon S3, you can on average expect to incur a loss of a single object once every 10,000 years.

Source: aws.amazon.com/s3/faqs

In other words, realistically, there's pretty much no way that you can possibly lose data stored on one of the standard S3/Glacier platforms because of infrastructure failure. However, it would be irresponsible to rely on your S3 buckets as the only copies of important data. After all, there's a real chance that a misconfiguration, account lockout, or unanticipated external attack could permanently block access to your data. And, as crazy as it might sound right now, it's not unthinkable to suggest that AWS could one day go out of business. Kodak and Blockbuster Video once dominated their industries, right? You should always back up your data to multiple locations, using different services and media types. You'll learn how to do that in Chapter 10, “The Reliability Pillar.” The high durability rates delivered by S3 are largely because they automatically replicate your data across at least three availability zones. This means that even if an entire AWS facility was suddenly wiped off the map, copies of your data would be restored from a different zone.

There is, however, one storage class that isn't quite so resilient. Amazon S3 Reduced Redundancy Storage (RRS) is rated at only 99.99 percent durability (because it's replicated across fewer servers than other classes). The RRS class is still available for historical reasons, but it's officially not recommended that you actually use it.

You can balance increased/decreased durability against other features like availability and cost to get the balance that's right for you. While all currently recommended classes are designed to withstand data loss 99.999999999% (11 nines) of the time, most will be maintained in at least three availability zones. The exception is S3 One Zone‐IA that, as the name suggests, stores its data in only a single zone. The difference shows up in its slightly lower availability, which we'll discuss next.

Availability

Object availability is also measured as a percentage; this time, though, it's the percentage you can expect a given object to be instantly available on request through the course of a full year. The Amazon S3 Standard class, for example, guarantees that your data will be ready whenever you need it (meaning it will be available) for 99.99% of the year. That means there will be less than nine hours each year of downtime. If you feel downtime has exceeded that limit within a single year, you can apply for a service credit. Amazon's durability guarantee, by contrast, is designed to provide 99.999999999% data protection. This means there's practically no chance your data will be lost, even if you might sometimes not have instant access to it.

S3 Intelligent‐Tiering is a relatively new storage class that saves you money while optimizing availability. For a monthly automation fee, Intelligent‐Tiering will monitor the way you access data within the class over time. It will automatically move an object to the lower‐cost infrequent access tier after it hasn't been accessed for 30 consecutive days.

Table 3.1 illustrates the availability guarantees for all S3 classes.

TABLE 3.1 Guaranteed availability standards for S3 storage

	S3 Standard	S3 Standard‐IA	S3 One Zone‐IA	S3 Intelligent‐Tiering
Availability guarantee	99.99%	99.9%	99.5%	99.9%

Eventually Consistent Data

It's important to bear in mind that S3 replicates data across multiple locations. As a result, there might be brief delays while updates to existing objects propagate across the system. Uploading a new version of a file or, alternatively, deleting an old file altogether can result in one site reflecting the new state with another still unaware of any changes.

To ensure that there's never a conflict between versions of a single object—which could lead to serious data and application corruption—you should treat your data according to an eventually consistent standard. That is, you should expect a delay (usually just two seconds or less) and design your operations accordingly.

Because there isn't the risk of corruption when creating new objects, S3 provides read‐after‐write consistency for creation (PUT) operations.

S3 Object Lifecycle

Many of the S3 workloads you'll launch will probably involve backup archives. But the thing about backup archives is that, when properly designed, they're usually followed regularly by more backup archives. Maintaining some previous archive versions is critical, but you'll also want to retire and delete older versions to keep a lid on your storage costs.

S3 lets you automate all this with its versioning and lifecycle features.

Versioning

Within many file system environments, saving a file using the same name and location as a preexisting file will overwrite the original object. That ensures you'll always have the most recent version available to you, but you will lose access to older versions—including versions that were overwritten by mistake.

By default, objects on S3 work the same way. But if you enable versioning at the bucket level, then older overwritten copies of an object will be saved and remain accessible indefinitely. This solves the problem of accidentally losing old data, but it replaces it with the potential for archive bloat. Here's where lifecycle management can help.

Lifecycle Management

In addition to the S3 Intelligent‐Tiering storage class we discussed earlier, you can manually configure lifecycle rules for a bucket that will automatically transition an object's storage class after a set number of days. You might, for instance, have new objects remain in the S3 Standard class for their first 30 days, after which they're moved to the cheaper One Zone IA for another 30 days. If regulatory compliance requires that you maintain older versions, your files could then be moved to the low‐cost, long‐term storage service Glacier for 365 more days before being permanently deleted.

Try it yourself with Exercise 3.2.

EXERCISE 3.2

Enable Versioning and Lifecycle Management for an S3 Bucket

Select your bucket and edit its properties to enable versioning.
Upload a file to that bucket, edit the copy on your local computer, and upload the new copy (keeping the filename the same). If you're working with a file from your static website, make sure you give the new file permissions allowing public access.
With the contents of the bucket displayed in the dashboard, select Show Versions. You should now see two versions of your file.
Add a couple of directories with files to your bucket.
From the bucket's Management tab, select Lifecycle and specify a prefix/tag filter that matches the directory name of one of the directories you uploaded.
Add a lifecycle rule by adding transitions and configuring the transition timing (in days) and target for each one.

You'll need to be patient to test this configuration because the minimum lag between transitions is 30 days.

Accessing S3 Objects

If you didn't think you'd ever need your data, you wouldn't go to the trouble of saving it to S3. So, you'll need to understand how to access your S3‐hosted objects and, just as important, how to restrict access to only those requests that match your business and security needs.

Access Control

Out of the box, new S3 buckets and objects will be fully accessible to your account but to no other AWS accounts or external visitors. You can strategically open up access at the bucket and object levels using access control list (ACL) rules, finer‐grained S3 bucket policies, or Identity and Access Management (IAM) policies.

There is more than a little overlap between those three approaches. In fact, ACLs are really leftovers from before AWS created IAM. As a rule, Amazon recommends applying S3 bucket policies or IAM policies instead of ACLs.

S3 bucket policies—which are formatted as JavaScript Object Notation (JSON) text and attached to your S3 bucket—will make sense for cases where you want to control access to a single S3 bucket for multiple external accounts and users. On the other hand, IAM policies—because they exist at the account level within IAM—will probably make sense when you're trying to control the way individual users and roles access multiple resources, including S3.

The following code is an example of an S3 bucket policy that allows both the root user and the user Steve from the specified AWS account to access the S3 MyBucket bucket and its contents. Both users are considered principals within this rule.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "AWS": ["arn:aws:iam::xxxxxxxxxxxx:root",
        "arn:aws:iam::xxxxxxxxxxxx:user/Steve"]
      },
      "Action": "s3:*",
      "Resource": ["arn:aws:s3:::MyBucket",
                   "arn:aws:s3:::MyBucket/*"]
    }
  ]
}

When it's attached to an IAM entity (a user, group, or role), the following IAM policy will accomplish the same thing as the previous S3 bucket policy:

{
  "Version": "2012-10-17",
  "Statement":[{
    "Effect": "Allow",
    "Action": "s3:*",
    "Resource": ["arn:aws:s3:::MyBucket",
                 "arn:aws:s3:::MyBucket/*"]
    }
  ]
}

IAM roles and policies will be discussed in greater detail in Chapter 6, “Authentication and Authorization—AWS Identity and Access Management.”

You can also closely control the way users and services access objects within your buckets by using Amazon S3 Access Points. An access point is a hostname that can point to a carefully defined subset of objects in a bucket. Depending on how you configure your access points, clients invoking the hostname will be able to read or write only the data you allow, and only as long as you allow it.

A simple AWS CLI command to request an access point might look something like this:

aws s3control create-access-point --name my-vpc-ap 
   --account-id 123456789012 --bucket my-bucket 
   --vpc-configuration VpcId=vpc-2b9d3c

Presigned URLs

If you want to provide temporary access to an object that's otherwise private, you can generate a presigned URL. The URL will be usable for a specified period of time, after which it will become invalid. You can build presigned URL generation into your code to provide object access programmatically.

The following AWS CLI command will return a URL that includes the required authentication string. The authentication will become invalid after 10 minutes (600 seconds). The default expiration value is 3,600 seconds (one hour).

aws s3 presign s3://MyBucketName/PrivateObject --expires-in 600

Try it yourself with Exercise 3.3.

EXERCISE 3.3

Generate and Use a Presigned URL

Use the complete URL of a private object in an S3 bucket belonging to you to generate a presigned URL using a variation of this command:
```
  aws s3 presign s3://MyBucketName/PrivateObject --expires-in 600
```
Copy the full URL from the command output and, from a browser that's not logged into your AWS account, try to open the file.
Wait for the URL to expire and try again. This time, it should not work.

Static Website Hosting

S3 buckets can be used to host the HTML files for entire static websites. A website is static when the system services used to render web pages and scripts are all client‐ rather than server‐based. This architecture permits simple and lean HTML code that's designed to be executed by the client browser.

S3, because it's such an inexpensive yet reliable platform, is an excellent hosting environment for such sites. When an S3 bucket is configured for static hosting, traffic directed at the bucket's URL can be automatically made to load a specified root document, usually named index.html. Users can click links within HTML pages to be sent to the target page or media resource. Error handling and redirects can also be incorporated into the profile.

If you want requests for a DNS domain name (like mysite.com) routed to your static site, you can use Amazon Route 53 to associate your bucket's endpoint with any registered name. This will work only if your domain name is also the name of the S3 bucket. You'll learn more about domain name records in Chapter 8, “The Domain Name System and Network Routing: Amazon Route 53 and Amazon CloudFront.”

You can also get a free SSL/TLS certificate to encrypt your site by requesting a certificate from AWS Certificate Manager (ACM) and importing it into a CloudFront distribution that specifies your S3 bucket as its origin.

Build your own static website using Exercise 3.4.

EXERCISE 3.4

Enable Static Website Hosting for an S3 Bucket

From the S3 dashboard, select (or create) an S3 bucket that contains at least one file with some simple text named index.html. Any files you want to be accessible should be readable by the public.
From the bucket Properties tab, enable static website hosting and specify your index.html file as your index document.
Paste the static website endpoint into the URL field of a browser that's not logged into your AWS account and confirm you can access the website.

Note that you can also enable static website hosting from the AWS CLI using a variation of these two commands:

aws s3api put-bucket-acl --bucket my-bucket --acl public-read
aws s3 website s3://my-bucket/ --index-document index.html --error-document error.html

AWS provides a different way to access data stored on either S3 Select or Glacier Select. The feature lets you apply SQL‐like queries to stored objects so that only relevant data from within objects is retrieved, permitting significantly more efficient and cost‐effective operations.

One possible use case would involve large comma‐separated values (CSV) files containing sales and inventory data from multiple retail sites. Your company's marketing team might need to periodically analyze only sales data and only from certain stores. Using S3 Select, they'll be able to retrieve exactly the data they need—just a fraction of the full data set—while bypassing the bandwidth and cost overhead associated with downloading the whole thing.

Amazon S3 Glacier

At first glance, Glacier looks a bit like just another S3 storage class. After all, like most S3 classes, Glacier guarantees 99.999999999 percent durability and, as you've seen, can be incorporated into S3 lifecycle configurations.

Nevertheless, there are important differences. Glacier, for example, supports archives as large as 40 TB, whereas S3 buckets have no size limit. Its archives are encrypted by default, whereas encryption on S3 is an option you need to select. And unlike S3's “human‐readable” key names, Glacier archives are given machine‐generated IDs.

But the biggest difference is the time it takes to retrieve your data. Retrieving the objects from an existing Glacier archive can take a number of hours, compared to nearly instant access from S3. That last feature really defines the purpose of Glacier: to provide inexpensive long‐term storage for data that will be needed only in unusual and infrequent circumstances.

In the context of Glacier, the term archive is used to describe an object like a document, video, or a TAR or ZIP file. Archives are stored in vaults—the Glacier equivalent of S3's buckets. Glacier vault names do not have to be globally unique.

There are currently two Glacier storage tiers: Standard and Deep Archive. As you can probably guess, Glacier Deep Archive will cost you less but will require longer waits for data retrieval. Storing one terabyte of data for a month on Glacier standard will, at current rates, cost you $4.10, while leaving the same terabyte for the same month on Glacier Deep Archive will cost only $1.02. Retrieval from Deep Archive, however, will take between 12 and 48 hours.

Table 3.2 lets you compare the costs of retrieving 100 GB of data from Glacier using each of its five retrieval tiers.

TABLE 3.2 Sample retrieval costs for Glacier data in the US East region

Tier	Amount retrieved	Cost
Glacier Standard	100 GB	$0.90
Glacier Expedited	100 GB	$3.00
Glacier Bulk	100 GB	$0.25
Deep Archive Standard	100 GB	$2.00

Storage Pricing

To give you a sense of what S3 and Glacier might cost you, here's a typical usage scenario. Imagine you make weekly backups of your company sales data that generate 5 GB archives. You decide to maintain each archive in the S3 Standard Storage and Requests class for its first 30 days and then convert it to S3 One Zone (S3 One Zone‐IA), where it will remain for 90 more days. At the end of those 120 days, you will move your archives once again, this time to Glacier, where they will be kept for another 730 days (two years) and then deleted.

Once your archive rotation is in full swing, you'll have a steady total of (approximately) 20 GB in S3 Standard, 65 GB in One Zone‐IA, and 520 GB in Glacier. Table 3.3 shows what that storage will cost in the US East region at rates current as of this writing.

TABLE 3.3 Sample storage costs for data in the US East region

Class	Storage amount	Rate/GB/month	Cost/month
Standard	20 GB	$0.023	$0.46
One Zone‐IA	65 GB	$0.01	$0.65
Glacier	520 GB	$0.004	$2.08
Total			$3.19

Of course, storage is only one part of the mix. You'll also be charged for operations including data retrievals; PUT, COPY, POST, or LIST requests; and lifecycle transition requests. Full, up‐to‐date details are available at aws.amazon.com/s3/pricing.

Exercise 3.5 will introduce you to an important cost‐estimating tool.

Other Storage‐Related Services

It's worth being aware of some other storage‐related AWS services that, while perhaps not as common as the others you've seen, can make a big difference for the right deployment.

Amazon Elastic File System

The Elastic File System (EFS) provides automatically scalable and shareable file storage to be accessed from Linux instances. EFS‐based files are designed to be accessed from within a virtual private cloud (VPC) via Network File System (NFS) mounts on EC2 Linux instances or from your on‐premises servers through AWS Direct Connect connections. The goal is to make it easy to enable secure, low‐latency, and durable file sharing among multiple instances.

Amazon FSx

Amazon FSx comes in two flavors: FSx for Lustre and Amazon FSx for Windows File Server. Lustre is an open source distributed filesystem built to give Linux clusters access to high‐performance filesystems for use in compute‐intensive operations. Amazon's FSx service brings Lustre capabilities to your AWS infrastructure.

FSx for Windows File Server, as you can tell from the name, offers the kind of file‐sharing service EFS provides but for Windows servers rather than Linux. FSx for Windows File Server integrates operations with Server Message Block (SMB), NTFS, and Microsoft Active Directory.

AWS Storage Gateway

Integrating the backup and archiving needs of your local operations with cloud storage services can be complicated. AWS Storage Gateway provides software gateway appliances (based on an on‐premises hardware appliance or virtual machines built on VMware ESXi, Microsoft Hyper‐V, Linux KVM, VMware Cloud on AWS, or EC2 images) with multiple virtual connectivity interfaces. Local devices can connect to the appliance as though it's a physical backup device like a tape drive, and the data itself is saved to AWS platforms like S3 and EBS. Data can be maintained in a local cache to make it locally available.

AWS Snowball

Migrating large data sets to the cloud over a normal Internet connection can sometimes require far too much time and bandwidth to be practical. If you're looking to move terabyte‐ or even petabyte‐scaled data for backup or active use within AWS, ordering a Snowball device might be the best option.

When requested, AWS will ship you a physical, 256‐bit, encrypted Snowball storage device onto which you'll copy your data. You then ship the device back to Amazon, where its data will be uploaded to your S3 bucket(s).

Choosing the best method for transferring data to your AWS account will require a bit of arithmetic. You'll have to know the real‐world upload speeds you get from your Internet connection and how much of that bandwidth wouldn't be used by other operations. In this chapter, you've learned about Multipart Upload and Transfer Acceleration for moving larger objects into an S3 bucket. But some objects are just so large that uploading them over your existing Internet connection isn't practical. Think about it; if your Internet service provider (ISP) gives you 10 MB/second, then, assuming no one else is using the connection, uploading a one‐terabyte archive would take you around 10 days!

So, if you really need to move that data to the cloud, you're going to have to either invest in an expensive AWS Direct Connect connection or introduce yourself to an AWS Snowball device (or, for really massive volumes of data, AWS Snowmobile at aws.amazon.com/snowmobile).

AWS DataSync

DataSync specializes in moving on‐premises data stores into your AWS account with a minimum of fuss. It works over your regular Internet connection, so it's not as useful as Snowball for really large data sets. But it is much more flexible, since you're not limited to S3 (or RDS as you are with AWS Database Migration Service). Using DataSync, you can drop your data into any service within your AWS account. That means you can do the following:

Quickly and securely move old data out of your expensive data center into cheaper S3 or Glacier storage.
Transfer data sets directly into S3, EFS, or FSx, where it can be processed and analyzed by your EC2 instances.
Apply the power of any AWS service to any class of data as part of an easy‐to‐configure automated system.

DataSync can handle transfer rates of up to 10 Gbps (assuming your infrastructure has that capacity) and offers both encryption and data validation.

AWS CLI Example

This example will use the AWS CLI to create a new bucket and recursively copy the sales‐docs directory to it. Then, using the low‐level s3api CLI (which should have been installed along with the regular AWS CLI package), you'll check for the current lifecycle configuration of your new bucket with the get‐bucket‐lifecycle‐configuration subcommand, specifying your bucket name. This will return an error, of course, since there currently is no configuration.

Next, you'll run the put‐bucket‐lifecycle‐configuration subcommand, specifying the bucket name. You'll also add some JSON code to the ‐‐lifecycle‐configuration argument. The code (which could also be passed as a file) will transition all objects using the sales‐docs prefix to the Standard‐IA after 30 days and to Glacier after 60 days. The objects will be deleted (or “expire”) after a full year (365 days).

Finally, you can run get‐bucket‐lifecycle‐configuration once again to confirm that your configuration is active. Here are the commands you would need to run to make all this work:

$ aws s3 mb s3://bucket-name
$ aws s3 cp --recursive sales-docs/ s3://bucket-name 
$ aws s3api get-bucket-lifecycle-configuration 
   --bucket bucket-name
$ aws s3api put-bucket-lifecycle-configuration 
   --bucket bucket-name 
   --lifecycle-configuration '{
    "Rules": [
        {
            "Filter": {
                "Prefix": "sales-docs/"
            },
            "Status": "Enabled",
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                },
                {
                    "Days": 60,
                    "StorageClass": "GLACIER"
                }
            ],
            "Expiration": {
                "Days": 365
            },
            "ID": "Lifecycle for bucket objects."
        }
    ]
}'
$ aws s3api get-bucket-lifecycle-configuration 
   --bucket bucket-name

Summary

Amazon S3 provides reliable and highly available object‐level storage for low‐maintenance, high‐volume archive and data storage. Objects are stored in buckets on a “flat” surface. However, through the use of prefixes, objects can be made to appear as though they're part of a normal filesystem.

You can—and usually should—encrypt your S3 data using either AWS‐provided or self‐serve encryption keys. Encryption can take place when your data is at rest using either server‐side or client‐side encryption.

There are multiple storage classes within S3 relying on varying degrees of data replication that allow you to balance durability, availability, and cost. Lifecycle management lets you automate the transition of your data between classes until it's no longer needed and can be deleted.

You can control who and what get access to your S3 buckets—and when—through legacy ACLs or through more powerful S3 bucket policies and IAM policies. Presigned URLs are also a safe way to allow temporary and limited access to your data.

You can reduce the size and cost of your requests against S3 and Glacier‐based data by leveraging the SQL‐like Select feature. You can also provide inexpensive and simple static websites through S3 buckets.

Amazon Glacier stores your data archives in vaults that might require hours to retrieve but that cost considerably less than the S3 storage classes.

Exam Essentials

Understand the way S3 resources are organized. S3 objects are stored in buckets whose names must be globally unique. Buckets are associated with AWS regions. Objects are stored within buckets on a “flat” surface, but prefixes and delimiters can give data the appearance of a folder hierarchy.
Understand how to optimize your data transfers. Although the individual objects you store within an S3 bucket can be as large as 5 TB, anything larger than 100 MB should be uploaded with Multipart Upload, and objects larger than 5 GB must use Multipart Upload.
Understand how to secure your S3 data. You can use server‐side encryption to protect data within S3 buckets using either AWS‐generated or your own privately generated keys. Data can be encrypted even before being transferred to S3 using client‐side encryption.
Understand how S3 object durability and availability are measured. The various S3 classes (and Glacier) promise varying levels of infrastructure reliability, along with data availability.
Understand S3 object versioning and lifecycle management. Older versions of S3 objects can be saved even after they've been overwritten. To manage older objects, you can automate transition of objects between more accessible storage classes to less expensive but less accessible classes. You can also schedule object deletion.
Understand how to secure your S3 objects. You can control access through the legacy, bucket, and object‐based ACL rules or by creating more flexible S3 bucket policies or, at the account level, IAM policies. You can also provide temporary access to an object using a presigned URL.
Understand how to create a static website. S3‐based HTML and media files can be exposed as a website that, with the help of Route 53 and CloudFront, can even live behind a DNS domain name and use encrypted HTTPS pages.
Understand the differences between S3 and Glacier. Glacier is meant for inexpensive long‐term storage for data archives that you're unlikely to need often.

Review Questions

Your organization runs Linux‐based EC2 instances that all require low‐latency read/write access to a single set of files. Which of the following AWS services are your best choices? (Choose two.)
1. AWS Storage Gateway
2. AWS S3
3. Amazon Elastic File System
4. AWS Elastic Block Store
Your organization expects to be storing and processing large volumes of data in many small increments. When considering S3 usability, you'll need to know whether you'll face any practical limitations in the use of AWS account resources. Which of the following will normally be available only in limited amounts?
1. PUT requests/month against an S3 bucket
2. The volume of data space available per S3 bucket
3. Account‐wide S3 storage space
4. The number of S3 buckets within a single account
You have a publicly available file called filename stored in an S3 bucket named bucketname. Which of the following addresses will successfully retrieve the file using a web browser?
1. s3.amazonaws.com/bucketname/filename
2. filename/ bucketname.s3.amazonaws.com
3. s3://bucketname/filename
4. s3://filename/bucketname
If you want the files stored in an S3 bucket to be accessible using a familiar directory hierarchy system, you'll need to specify prefixes and delimiters. What are prefixes and delimiters?
1. A prefix is the name common to the objects you want to group, and a delimiter is the bar character (|).
2. A prefix is the DNS name that precedes the amazonaws.com domain, and a delimiter is the name you want to give your file directory.
3. A prefix is the name common to the objects you want to group, and a delimiter is a forward slash character (/).
4. A prefix is the name common to the file type you want to identify, and a delimiter is a forward slash character (/).
Your web application relies on data objects stored in AWS S3 buckets. Compliance with industry regulations requires that those objects are encrypted and that related events can be closely tracked. Which combination of tools should you use? (Choose two.)
1. Server‐side encryption
2. Amazon S3‐Managed Keys
3. AWS KMS‐Managed Keys
4. Client‐side encryption
5. AWS End‐to‐End managed keys
You are engaged in a deep audit of the use of your AWS resources and you need to better understand the structure and content of your S3 server access logs. Which of the following operational details are likely to be included in S3 server access logs? (Choose three.)
1. Source bucket name
2. Action requested
3. Current bucket size
4. API bucket creation calls
5. Response status
You're assessing the level of durability you'll need to sufficiently ensure the long‐term viability of a new web application you're planning. Which of the following risks are covered by S3's data durability guaranties? (Choose two.)
1. User misconfiguration
2. Account security breach
3. Infrastructure failure
4. Temporary service outages
5. Data center security breach
Which of the following explains the difference in durability between S3's One Zone‐IA and Reduced Redundancy classes?
1. One Zone‐IA data is heavily replicated but only within a single availability zone, whereas Reduced Redundancy data is only lightly replicated.
2. Reduced Redundancy data is heavily replicated but only within a single availability zone, whereas One Zone‐IA data is only lightly replicated.
3. One Zone‐IA data is replicated across AWS regions, whereas Reduced Redundancy data is restricted to a single region.
4. One Zone‐IA data is automatically backed up to Amazon Glacier, whereas Reduced Redundancy data remains within S3.
Which of the following is the 12‐month availability guarantee for the S3 Standard‐IA class?
1. 99.99 percent
2. 99.9 percent
3. 99.999999999 percent
4. 99.5 percent
Your application regularly writes data to an S3 bucket, but you're worried about the potential for data corruption as a result of conflicting concurrent operations. Which of the following data operations would not be subject to concerns about eventual consistency?
1. Operations immediately preceding the deletion of an existing object
2. Operations subsequent to the updating of an existing object
3. Operations subsequent to the deletion of an existing object
4. Operations subsequent to the creation of a new object
You're worried that updates to the important data you store in S3 might incorrectly overwrite existing files. What must you do to protect objects in S3 buckets from being accidentally lost?
1. Nothing. S3 protects existing files by default.
2. Nothing. S3 saves older versions of your files by default.
3. Enable versioning.
4. Enable file overwrite protection.
Your S3 buckets contain many thousands of objects. Some of them could be moved to less expensive storage classes and others still require instant availability. How can you apply transitions between storage classes for only certain objects within an S3 bucket?
1. By specifying particular prefixes when you define your lifecycle rules
2. This isn't possible. Lifecycle rules must apply to all the objects in a bucket.
3. By specifying particular prefixes when you create the bucket
4. By importing a predefined lifecycle rule template
Which of the following classes will usually make the most sense for long‐term storage when included within a sequence of lifecycle rules?
1. Glacier
2. Reduced Redundancy
3. S3 One Zone‐IA
4. S3 Standard‐IA
Which of the following are the recommended methods for providing secure and controlled access to your buckets? (Choose two.)
1. S3 access control lists (ACLs)
2. S3 bucket policies
3. IAM policies
4. Security groups
5. AWS Key Management Service
In the context of an S3 bucket policy, which of the following statements describes a principal?
1. The AWS service being defined (S3 in this case)
2. An origin resource that's given permission to alter an S3 bucket
3. The resource whose access is being defined
4. The user or entity to which access is assigned
You don't want to open up the contents of an S3 bucket to anyone on the Internet, but you need to share the data with specific clients. Generating and then sending them a presigned URL is a perfect solution. Assuming you didn't explicitly set a value, how long will the presigned URL remain valid?
1. 24 hours
2. 3,600 seconds
3. 5 minutes
4. 360 seconds
Which non‐S3 AWS resources can improve the security and user experience of your S3‐hosted static website? (Choose two.)
1. AWS Certificate Manager
2. Elastic Compute Cloud (EC2)
3. Relational Database Service (RDS)
4. Route 53
5. AWS Key Management Service
What is the largest single archive supported by Amazon Glacier?
1. 5 GB
2. 40 TB
3. 5 TB
4. 40 GB
You need a quick way to transfer very large (peta‐scale) data archives to the cloud. Assuming your Internet connection isn't up to the task, which of the following will be both (relatively) fast and cost‐effective?
1. Direct Connect
2. Server Migration Service
3. Snowball
4. Storage Gateway
Your organization runs Windows‐based EC2 instances that all require low‐latency read/write access to a single set of files. Which of the following AWS services is your best choice?
1. Amazon FSx for Windows File Server
2. Amazon FSx for Lustre
3. Amazon Elastic File System
4. Amazon Elastic Block Store

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.