Data storage using S3

Loading the files into S3 is a task that can be achieved in multiple ways—you can get creative depending on the requirements, data volume, file update frequency, and other variables. A few possible ways are as follows:

  • A custom shell script (or a set of scripts) that runs AWS CLI commands to send the files to an S3 bucket
  • The AWS SDK for your favorite language
  • AWS DataSync
  • AWS Storage Gateway

We're going to use the AWS CLI directly from the command line to sync a local folder with a newly created bucket:

  1. Let's create a bucket that will receive our raw data. As you already know, S3 bucket names must be unique, so update the following command accordingly and choose your unique bucket name. This will create the bucket to store our raw data:
aws s3 mb s3://[raw-data-bucket-name]/
  1. Now that the bucket is ready, let's send the file from the local machine to S3 using your chosen upload method. We want to copy the weather-dataset.json file in to our raw data bucket. Here's an AWS CLI command that will get the job done:
aws s3 sync ~/[path-to-downloaded-file]/weather-dataset.json s3://[raw-data-bucket-name]/

With the data file in S3, we can move on to AWS Glue and start the data processing step. The first topic to cover with Glue is how to create the data catalog, so let's find out what that is in the next section.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.15.186.79