Custom classifiers

Head over to the Glue console. In this section, we'll be using the components under the Data Catalog group.

The first component of our data catalog that we need to create is a classifier. As you can observe in the JSON file structure of our sample data, we have a nested dataset of the file that contains structs and arrays. This means we need to tell Glue how to read the file by specifying a JSON path; otherwise, each entry of the file will be loaded as a single line, which is not very useful for analysis. Later on, we will see we can unnest the records using Athena as well, but we want to avoid unnecessary work.

For files that don't contain nested structures, this step is not necessary as the default JSON classifier embedded in Glue will be perfectly fine.

If you're not following along yourself, here's a snapshot of the dataset file so that you can get an idea:

Screenshot of a JSON file viewed in Visual Studio Code

Let's follow an example so that we can create a classifier in AWS Glue:

  1. To create a classifier, go to the Glue console, click on Classifiers under Crawlers, and hit the Add classifier button. Complete the definition by using the information shown in the following screenshot. This will create a new custom classifier for our data:

Adding a classifier

The $[*] pattern tells Glue to look at all the structs in the JSON schema and convert them into columns.

Now, you should see your newly created classifier, which is highlighted in the red box, as shown in the following screenshot:

List of available classifiers

You can find more information about Glue classifiers in the Further reading section of this chapter.

In the next section, we will create a database for our data catalog and find out exactly what Glue defines as a database. It's not what you might expect!

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.191.189.23