Creating a Glue job

Okay—it's time to create our first Glue job:

  1. On the AWS console, go to the Glue console and on the left-hand side menu, under ETL, click on Jobs.
  2. Then, hit the Add Job button. The Add Job step-by-step configuration process will start and you can set up the ETL process. Ideally, you will give it a name that describes what the job will do. In this case, I'll call it TransformWeatherData and will use the same IAM role I've used for the crawler—make sure yours has the necessary permissions for the new location. The following screenshot shows an example of my configuration, but yours will have different S3 locations:

Creating a Glue job
  1. Leave the other fields with default values, including A proposed script generated by AWS Glue, which will give us a baseline script so that we can customize transformations if required, or move data from one dataset to another.
  2. Expand the Security configuration, script libraries, and job parameters (optional) option and under Maximum Capacity, change the default value from 10 to 2 and set the Job timeout to 10 minutes. We are processing a small amount of data, so that should be more than enough. The following is a sensible configuration for our job parameters:

Glue job properties

AWS Glue uses Data Processing Units (DPUs) that are allocated when a job starts. The more DPUs a job uses, the more parallelism will be applied and potentially, the faster the job will run. Each ETL job must be evaluated and optimized accordingly to run as fast as possible with the minimum incurring costs as possible.

Allocating DPUs costs money. When you run a Glue job, you pay for the amount of allocated DPUs and for the amount of time the job runs at a minimum of 10 minutes, and at 1-minute increments after that. This means that if your job runs under 10 minutes anyway, ideally, you should allocate the minimum amount of DPUs as possible. Glue requires at least 2 DPUs to run any given job.

  1. Hit Next.
  2. Choose your data source. We currently have our raw data in an S3 bucket; what we want to do is read raw data, apply some transformation, and write to another bucket—one that will hold transformed data. In our example, we want to select raw as the data source and click Next. See the following screenshot, which is highlighting our raw data source:

Choosing a data source
  1. Choose a transform type. Select the Change schema option as we are moving data from one schema to another and click Next. The following screenshot shows the options that are available:

Adding transformation options

  1. Choose a data target. We need to choose where the data will be written after it's been transformed by Glue. Since we don't have an existing target database, such as an RDS Aurora or a Redshift database, we are going to create a new dataset on S3 when the job runs. The next time we run the same job, the data will be appended unless you explicitly delete the target dataset on your transformation script.
  2. Select Amazon S3 from the Data store drop-down menu and the output Format as Parquet, given we want this dataset to be optimized for analytical queries. Other columnar data formats are also supported such as ORC and Avro, as you can see in the Format drop-down menu. Common formats such as CSV and JSON are not ideal for analytics workloadswe will see more of this when we work with queries on AWS Athena. The following screenshot shows the options that are available, including where to select our target file format:

Choosing a data target
  1. Now, we need to map the source columns to target columns. As you can observe on the source side (left), we have a few different data types coming from our JSON file. We have one string, one double, and one int column, all of which are clearly defined. We also have one array and one struct, which are more complex data types, given they have nested objects within them. The following screenshot shows the fields from my raw data—yours might be different if you are using a different dataset:

Raw data structure against the target structure

When we expand city, we can see there's another struct level called coord, which finally has the latitude and longitude (of the double type) of the city. Have a look at the following screenshot to see what I mean:

Nested structures in our raw data

The nested structures under city are not mapped by default on the Glue console. This means that if we leave the mapping as is, city data will be loaded to the target as a nested structure just like the source, but that will make our queries more verbose and it will make it slightly harder to work with our dataset. Because of that annoyance, we will flatten out the structure of the city data and only select the fields that will be useful to our analysis, not all of them.

  1. To do that, we need to remove the mapping for the city field by clicking on the x next to the Data type on the Target side (right) for all the columns under city (inclusive) until your mapping looks as follows:

Changes to the target data structure
  1. Now, we need to map the data as per our requirements. Click on Add Column at the top-right corner and create a new column on the target called cityid of the int type.
  2. Go to the id column under city in the source and click on the row. A drop-down menu will allow you to select which column will receive the value of the id column. Choose the one you just created, cityid, and hit Save. The following screenshot shows you what to expect when you click the drop-down menu:

Selecting the mapping between raw and target fields

A new arrow from the Source column to the Target column will appear and the ID of the city will be loaded to the target as a column by itself rather than as part of a nested structure.

  1. Repeat this process by adding the columns listed here and their appropriate mapping. Don't panic if these fields aren't the same as yoursfollow your nose to map yours to the right fields:
cityname
country
population
sunset
sunrise
latitude
longitude

Once you've finished the mapping configuration, it should look something like this. The following shows how we are mapping a bunch of nested structures to a flat target structure:

Our targeted structure for transformation
  1. Now, you can click on Save Job and Edit Script. The script editor will show you the generated code that you can now customize.
  2. For now, we won't change the content of the scriptall we want to do is flatten out the city data structure that we configured in the previous step.

If you want to see where that transformation is implemented on the script, it's usually done with the ApplyMapping class. In this case, that's line 25 of our auto-generated transformation script. If you do some side-scrolling, you will find the mapping of the columns that are coming from the city struct. Close the script editor by clicking on the at the top-right corner of the editor:

AWS Glue script editor
As you will see when creating a Glue job, Glue scripts are stored on S3 buckets. That means you have all of the flexibility of S3 and can use any IDE to create and develop your transformation scripts in Python or Scala and upload them to S3. Also, setting up a CI/CD pipeline and automating the testing of your scripts is highly recommended.

Once you are happy with the transformation script and all of the settings of the job that has just been created, feel free to go ahead and jump to the next section to learn how you can run your newly created Glue job.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.122.11