Where do I start?

Start with batch processing. Always start with batch! To build our first batch workload, we will use a combination of the services mentioned here:

  • S3: To be able to process data in batch mode, we need to have our raw data stored in a place where data can be easily accessible by the services we are going to use. For example, the transformation tool, Glue, must be able to pick up the raw files and write them back once transformations are done.
  • Glue: This is the batch transformation tool we will use to create the transformation scripts, schedule the job, and catalog the dataset we are going to create. We also need to create crawlers that will scan the files in S3 and identify the schemacolumns and data typesthe files contain.
  • Athena: To query the datasets we store in S3 and catalog on Glue, we can use AWS Athena, a serverless query tool that allows you to query files using SQL, even though the data is not stored in a relational database. Athena leverages the data catalog that's created and populated by Glue. We'll explore this in the next section once we have our data catalog.

For this example, we will use some files containing meteorology data in JSON format. The sample data was collected from the https://openweathermap.org/ website, which makes weather data openly available. There are a plethora of websites, such as https://www.kaggle.com/ and https://www.data.govt.nz/about/open-data-nz/, that offer hundreds of open datasets if you want to explore other options. Once we have the files in an S3 bucket, we need to understand the structure and content of the data.

Now that we know how to approach building our first workload, let's have a look at what we will build at a high level.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.211.239