Running a Glue job

So far, we have created a job that knows about our data. We've specified which fields are extracted, which structures we want to be transformed, and where to load the data to. Now, we are ready to run the Glue job, so let's learn how to do that by following these steps:

  1. Select your job by clicking on the checkbox next to it. Then, hit Action | Run job:

Actions in the job screen in the Glue console
  1. While your job is running, select the checkbox next to the job name again if not already selected. On the detail panel below, you will see your job running.
  2. Click on the Run ID to get job details. You can also check the logs that are being generated and if you have errors, click on Error Logs to go into CloudWatch and see what happened.

Instead of using the console, you can also run the job by using the following CLI command:

aws glue start-job-run
--job-name TransformWeatherData
--timeout 10
--max-capacity 2

The CLI will output the Job Run ID as per the following if your CLI configuration is set to return responses in JSON format. You can compare that with the Glue console:

{
"JobRunId": "jr_cd30b15f97551e83e69484ecda992dc1f9562ea44b53a5b0f77ebb3b3a506022"
}

My job took 5 minutes to run, as shown in the following screenshot. You should see the Run Status update to Succeeded:

Job screen showing the progress of a running job

Just to confirm that some data was generated on S3, let's have a look at the bucket where the Glue job has written the data. 

  1. Head over to the S3 console and look at the target location in your S3 bucket:

The listing of our S3 bucket with the transformed parquet objects

Hundreds of Parquet files were generated and written to our bucket. If you try to download one of these files and open it on a text editor, it won't work as Parquet files are binary; what we need is a piece of technology such as AWS Athena to be able to read them and run some analytical queries.

But before we move on, to enable Athena to read the schema and content of the Parquet files in our bucket, we need to update our Data Catalog again. This is so that the new dataset that contains transformed data is included. You have already learned how to create a crawler, so go ahead and create a new one called tfdDataWeather. You can also either create a new database on the data catalog or you can use the same weather database. As the path for the raw dataset and transformed dataset are different, separate tables will be created so that there's no problem in using the same database as the one that's used for the raw data. Once the new crawler is created, don't forget to run it.

Excellent—so what we have done in this section is learn how to create and run an entire ETL process using AWS Glue. That's pretty awesome.

Now that we have some newly transformed data, let's go and analyze it. Move on to the next section when you're ready.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.118.7.102