This chapter is the one I have been looking forward to the most. Although all the chapters have been very enjoyable to write, setting the right tone and content is technically challenging. This one is set to stretch our boundaries the most.
While writing this chapter, some differences were observed between the Snowflake documentation and AWS Management Console layout; hence some sections have textual descriptions.
What is semi-structured and unstructured data?
Semi-structured data is typically a document format where all the relationships are described in the document. In other words, a single document containing semi-structured data is self-contained, self-describing, and not reliant upon external references to be complete. While the content is usually human-readable, we often use external tools to display the content in an easily understood format. Don’t worry. All will become clear when we dive into one of the semi-structured formats (JSON).
Unstructured data is all around us. We rarely think of our daily interaction with media in general as unstructured data, but it is. We use unstructured data whenever we watch TV, read a book, or open an email. Learning how to interact with the emerging—and for some of us, the very exciting—realm of unstructured data is where the “art of the possible” has become a reality. This chapter demonstrates how to load images into Snowflake, allowing programmatic content interrogation. I can’t wait to get into this part. I never thought I would get excited over invoices.
There are plenty of books available on structured data. Instead, we start with semi-structured data, JSON, and other familiar formats. I hope to offer fresh insights, tools, and techniques to address content before moving on to unstructured data , where the various data formats bring their own challenges. I offer ways to extract the content into Snowflake, taking advantage of the latest (at the time of writing) features and providing a framework for extending unstructured support according to your needs.
As Saurin Shah, a Snowflake product manager, says, “It doesn’t matter how we store the data. It matters how data is stored and consumed.” Integrating semi-structured and unstructured data into Snowflake has significant potential to unlock and transform value for our organizations, particularly when enriching with our relational data. Imagine the possibilities that become a reality if we can readily access content locked up inside our documents, PDFs, and images of various formats and then add real-time streaming data.
Semi-Structured Data
This chapter focuses on JSON. Snowflake supplies the tutorial at https://docs.snowflake.com/en/user-guide/json-basics-tutorial.html . Snowflake also supports other semi-structured file formats, including Avro, ORC, Parquet, and XML to a lesser degree. Returning to JSON, you may find the tool at https://jsonformatter.org useful.
AWS allows a single SQS queue on the S3 bucket. Using AWS Management Console , create a new S3 bucket, btsdc-json-bucket, copying settings from the existing btsdc-test-bucket.
STORAGE_AWS_IAM_USER_ARN: arn:aws:iam::321333230101:user/vnki-s-ukst5070
STORAGE_AWS_EXTERNAL_ID: GH06274_SFCRole=3_Vf/LKebJXaqzvO+unFpnT5OzqGM=
In AWS Management Console , we must edit our role trust relationship to add the preceding STORAGE_AWS_IAM_USER_ARN and STORAGE_AWS_EXTERNAL_ID values. Navigate to IAM ➤ Roles. In the drop-down list, select S3, and then “Allow S3 to call AWS services on your behalf,” and click Next.
ENCRYPTION addresses where in the data upload process our files are encrypted, documentation notes we should specify server-side encryption (SNOWFLAKE_SSE) for unstructured data files; see https://docs.snowflake.com/en/sql-reference/sql/create-stage.html#internal-stage-parameters-internalstageparams .
Then save the changes.
File Format
In previous chapters, we encountered the FILE FORMAT object , but we did not pay much attention to them. We used simple declarations to skip the first record, since this is typically a header record used to convey attribute names that already exist in our staging tables.
When writing this book, XML is only supported as a Public Preview feature.
Please also refer to Snowflake documentation at https://docs.snowflake.com/en/sql-reference/sql/create-file-format.html#create-file-format .
The order of precedence for a FILE FORMAT application varies according to where FILE FORMAT is declared. Please refer to https://docs.snowflake.com/en/user-guide/data-load-prepare.html#overriding-default-file-format-options .
JSON
JSON is short-hand for JavaScript Object Notation, a human-readable text format for storing and transporting information. JSON content is self-describing and, due to the nested nature of relationships in parent-child form, easy to understand.
Immediately we can read the JSON structure and recognize meaningful information, but what is less clear (this example has deliberately been structured to illustrate the point) is the built-in structured nature of JSON. To view in a more readable format, please navigate to https://jsonformatter.org , then type or cut and paste the preceding into the browser from the accompanying file, where a cleaner, easier-to-read version is presented in the right-hand pane.
Please experiment with changing the content and then “validate” your changes. Note the use of curly braces ({ }) that hold objects resulting in key/value pairs and square brackets ([ ]) that hold lists and arrays. These concepts are important to grasp for later use in materialized views explained next.
Variant Data Type
Now that you have a basic grasp of JSON , let’s prepare a file containing the preceding or reformatted content from https://jsonformatter.org and load it into S3. For this demonstration and using your favorite text editor, please create json_test.json and upload it into S3.
With our sample file json_test.json uploaded to S3, we can see the json_test.json file contents.
For troubleshooting external stage issues, please see Chapter 7, where a step-by-step configuration guide is available.
For our semi-structured data , we use the VARIANT data type. There are others, including OBJECT and ARRAY. All are described at https://docs.snowflake.com/en/sql-reference/data-types-semistructured.html#semi-structured-data-types .
VARIANT is limited to 16 MB in size.
Handling Large Files
What if our JSON file exceeds the maximum variant size of 16 MB? What options do we have to load larger files?
But this approach does not work for JSON without an outer array to strip. We need to consider other options.
While researching this chapter, I came across a technique worthy of consideration. I have not tested or endorsed the next approach. I simply offer it as a novel way forward; see https://medium.com/snowflake/breaking-the-16-mb-limit-kinda-fe8d36658b .
Other approaches to splitting large JSON files include utilizing an ELT tool, which may offer built-in capabilities, or developing a custom process using a third-party tool to pre-process JSON into manageable chunks, as jq purports to do; see https://stedolan.github.io/jq/ . Also, consider using an external table to access your JSON file and pre-processing data before loading it into Snowflake.
Regardless of the approach chosen, remember the S3 bucket where your JSON file lands using SQS could invoke a Lambda function to pre-process, then copy to the S3 bucket where Snowpipe or other auto-ingestion processing occurs. Unfortunately, a fuller investigation is beyond the scope of this book.
Materialized Views
Having successfully landed our test file into our staging table, we must consider how to access the JSON content . At this point, we now discuss the merits of directly accessing JSON content and creating materialized views that overlay the JSON structure.
We aim to normalize the JSON structure into a relational format to facilitate joining with our structured data content. We may choose to implement a suite of three normal forms of materialized views which would extend the pattern explained here, which stops at a single denormalized materialized view.
LATERAL (documentation at https://docs.snowflake.com/en/sql-reference/constructs/join-lateral.html )
FLATTEN() (documentation at https://docs.snowflake.com/en/sql-reference/functions/flatten.html#flatten )
Nested records in square brackets should be lateral flattened.
Direct Query or Materialized View
Now we have built code to perform both direct query access against our staged data and a materialized view. We should consider the relative merits of both approaches.
With direct query access, we can easily address new elements as they appear in the source JSON document when they arrive but at the cost of explicit attribute referencing. As you can see, the syntax is somewhat convoluted, labor intensive to implement, does not itself to integration with other Snowflake objects and JSON does not support a data type any data, dates, timestamps, and numbers are treated as strings. The advantages of direct query access are no additional disk storage charges and end-user access to the raw JSON.
Abstraction from the JSON pathing by renaming attributes into more user-friendly conventions
Improved performance when accessing deeply buried information in the JSON
Cleaner syntax when integrating with other Snowflake objects and the possibility of micro-partition pruning improves performance due to both micro-partition and cluster key definition
Materialized views used for summarizations and aggregations and subject to RBAC
Conversion from JSON date and timestamp strings into their corresponding Snowflake data type provides an opportunity for performant temporal queries.
Materialized views can be used for query rewrite even if not referenced in the SQL
The general use cases for direct query access are when results change often, results are not used often, and queries are not resource-intensive. The general use cases for materialized views are when results don’t change often, results are often used, and queries are resource intensive.
In summary, our preferred approach is to retain our raw JSON and build materialized views on top, providing the best of both approaches but at the cost of additional (cheap) storage for the materialized view. Note that materialized views can also be used on external tables.
MV Limitations
Serverless compute is used for maintenance and therefore incurs both storage and compute costs for automatic management as the materialized view stores the results of the declaring query. Costs are a function of the amount of data that changes in each base table, the number of materialized views created on each table with compute costs calculated in 1-second increments .
The Time Travel feature is not supported in materialized views.
Automation
In Chapter 8, we implemented Snowpipe , recognizing that materialized views by their nature are limited to the current version of data only. We may wish to historize ingested JSON documents.
However, we can do more with a stage directory where Snowflake supplies several file functions documented at https://docs.snowflake.com/en/sql-reference/functions-file.html#file-functions .
Note the task has not resumed because the stored procedure has not been declared; this is an exercise for you to complete.
Unstructured Data
If using AWS Free Tier, be aware that Textract usage limitations of 100 invocations before charges apply.
Once a JSON file has landed in S3, we can treat the content as semi-structured data and ingest it according to the preceding pattern.
If our file exceeds 16 MB, we may also pre-process it to split into multiple source files or strip the outer array according to need.
File Preparation
In AWS Management Console , navigate to S3 and create a new S3 bucket: btsdc-ingest-bucket. We require a second bucket to prevent circular writes when the Lambda function is triggered.
For testing purposes, we use a simple PNG file created from a screenshot of an invoice.
We will later upload the test files into btsdc-ingest-bucket. The Lambda function is triggered, and files are copied to btsdc-test-bucket.
Lambda
We now focus on creating a Lambda function configured using our AWS account.
This is test code and test configuration. It should not be relied on for production implementation.
A basic Lambda function is available in the Code source ➤ lambda_function tab. Note that the Deploy button is grayed out. Whenever we change the Python code, the Deploy button is enabled.
If an error is received during your testing, it may be due to prior event registration. Navigate to S3 ➤ Properties ➤ Event notification and remove any events from file_to_json, and then retry.
Triggers can be viewed under the Configuration ➤ Triggers tab.
The execution role can be viewed under the Configuration ➤ Execution role tab, which needs amending. Click Edit and select the “Create a new role from AWS policy templates” option. Enter a name: S3-ingest-to-S3-stage. Then select “Amazon S3 object read-only permissions” from the drop-down list. The screen automatically returns to the Lambda summary screen.
This URL navigates directly to the page to edit; see https://console.aws.amazon.com/iamv2/home?#/roles/details/S3-ingest-to-S3-stage?section=permissions .
Alternatively, navigate to Lambda ➤ Functions ➤ file_to_json ➤ Configuration ➤ Execution role.
Your organization’s security profile may restrict access, do not assume the following suggested policy meets requirements.
Click “Review policy” and then the “Save changes” button, which returns to the Lambda function overview.
This code copies a file from the source S3 bucket into the destination bucket. Cut and paste it into the Code tab to test if the policy works correctly.
destination_bucket_name must be changed as in AWS. S3 bucket names are global in scope across all AWS accounts.
The following code is intended to be cut and pasted, Python is indenting sensitive, and the code format incorrectly implies a line break after the “# Specify source bucket” and “# Write copy statement” statements.
The AWS Lambda console implements alignment markers that must be adhered to; otherwise, an indentation error occurs.
Then click Deploy.
Upload files to btsdc-ingest-bucket and check they appear in btsdc-test-bucket. We may also check the file has been copied across correctly by downloading it from the target S3 bucket.
Delete files from both btsdc-ingest-bucket and btsdc-test-bucket.
Extracting File Content
This section has the potential to expand into a separate book itself. One challenge with authoring code samples is keeping content to manageable sections while conveying enough information to inform and equip the audience. Cue one epic fail!
Let’s start by investigating Amazon Textract, a file processing utility. Its documentation is at https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.analyze_document . Note that Textract has two modes of operation: synchronous for small files and asynchronous for larger files. In this chapter, we investigate Synchronous operation using textract.analyze_document. Documentation is at https://docs.aws.amazon.com/textract/latest/dg/analyzing-document-text.html while integrating with our S3 copy code.
I confess. I did not author all of the following code. It is based upon code supplied and used with the kind permission of Saurin Shah of Snowflake Inc., extended to showcase the “art of the possible” with apologies for removing comments and blank lines to reduce the page count. The accompanying file has a complete version.
Mapping to JSON
There is one caveat. During testing, we encountered the following error.
Invalid UTF8 detected in string ‘ “0xA3 ”: “”,’ File ‘test_json.png.json’, line 38, character 1, row 37, column “TRANSIENT_STAGE_TABLE”[“$1”:1].
The fix was to delete the offending line (38) in JSON and then reload and retry.
You may also find it necessary to load your file into https://jsonformatter.org and save the output before uploading .
With data accessible, I leave it to you for further processing.
Troubleshooting
SQL Integration
The previous section demonstrated how to automatically process files on receipt and convert them to JSON. This chapter builds all required Snowflake capability to stage and interact with JSON. Let’s now focus on directly accessing the content in files held on S3 using SQL. To do so, we must create an external function to call a remote service, in our example, an AWS Python Lambda, using API integration for which documentation is at https://docs.snowflake.com/en/sql-reference/sql/create-api-integration.html#for-amazon-api-gateway and https://docs.snowflake.com/en/sql-reference/external-functions-creating-aws.html . Note the very helpful embedded YouTube tutorial.
Tracking Template
We populate the following template copied from https://docs.snowflake.com/en/sql-reference/external-functions-creating-aws-planning.html#preparing-to-use-the-aws-management-console . You may also wish to add this template to your support documentation for each external function created.
Using the template saves you time and prevents troubleshooting and rework. I found out the hard way!
Create a Python Lambda Function
Repeating the steps from Figure 10-9 and Figure 10-10, create a new Lambda function called parse_image. Select Python 3.9 as the runtime, then leave the default execution role as “Create a new role with basic Lambda permissions” before clicking the “Create function” button. After a minute or so, our new function appears similar to Figure 10-11.
You should see “Execution result: succeeded.”
Create AWS IAM Role
We now use AWS Management Console to configure the IAM role for Snowflake use. Click Roles ➤ Create role. Our role is for another AWS account which is a trusted entity. Select “AWS account” and use your account number recorded in the tracking template (mine is 616701129608). Then click Next ➤ Next. Add a meaningful name such as Snowflake_External_Function, and then click Create role.
Once created, click the new role and record the ARN. (Mine is arn:aws:iam::616701129608:role/Snowflake_External_Function.)
Get Snowflake VPC Information
From which we add privatelink-vpce-id: com.amazonaws.vpce.eu-west-2.vpce-svc-0839061a5300e5ac1 to our tracking template. Note that yours will differ.
Create API Gateway Endpoint
In AWS Management Console , search for “API Gateway ” and select the service. From the options presented, find Rest API and click the Build button. Then select New API, at which point you see the Settings screen.
Enter the API name: get_image_content. Optionally, you can populate the Description field. Leave the Endpoint Type set to Regional, and then click Create API.
In the Methods screen, click Actions and select Create Resource. Set the resource name to snowflake_proxy before clicking the “Create resource” button.
Next, click Actions and select Create Method. From the drop-down list, select POST, and click the tick mark. Leave Lambda Function selected as the integration type. Select the “Use Lambda Proxy integration” option. In the Lambda Function dialog box, enter parse_image and click Save. In the pop-up box, click OK. The screen now presents a schematic showing request/response. This can be ignored.
From the Actions drop-down, select Deploy API action. Set the Deployment Stage to [New Stage], set the stage name to “test”, and then click Deploy.
The newly created “test” stage can now be selected and the tree opens. Click POST and record the Invoke URL (in my example, https://8qdfb0w8fh.execute-api.eu-west-2.amazonaws.com/test/snowflake_proxy ). Record this in the tracking template. Note that yours will differ.
Secure API Gateway Proxy
Click Amazon API Gateway ➤ get_image_content ➤ POST ➤ Method Request. Next, click the Edit symbol in Authorization , and in the drop-down list, select AWS IAM. Click the tick mark to save. From the Actions drop-down list, select the Deploy API action. Set the deployment stage to “test” and click Deploy.
Then click Method Execution and retrieve the ARN; in my example, it is arn:aws:execute-api:eu-west-2:616701129608:8qdfb0w8fh/*/POST/snowflake_proxy. Record this in the tracking template. Note thatyours will differ.
On the left of the screen, select Resource Policy and paste the example resource policy from the documentation at https://docs.snowflake.com/en/sql-reference/external-functions-creating-aws-ui-proxy-service.html#secure-your-amazon-api-gateway-endpoint .
Replace <12-digit-number> with your AWS account ID, in our example this is 616701129608. Replace <external_function_role> with Snowflake_External_Function. Then replace <method_request_ARN> with your method execution ARN.
Click the Save button before clicking get_image_content in the header bar. Click Actions ➤ Deploy API. In the deployment stage, select “test” and then click Deploy.
Snowflake API Integration
In this section, I have substituted values from my AWS account; yours will differ and must be replaced.
Create API Key
This section is included for completeness only. You may not need an API key, but if your organization requires API keys, here is how to create them. Documentation is at https://docs.snowflake.com/en/sql-reference/external-functions-security.html#using-the-api-key-option-in-create-api-integration .
From AWS Management Console, navigate to API Gateway ➤ get_image_content, and on the left-hand panel, select API Keys ➤ Actions ➤ Create API key.
In the Name dialog, enter get_image_content_key. Leave the API key set to Auto Generate, and optionally add a description before clicking Save.
To see the generated key, navigate to API Gateway ➤ get_image_content. On the left-hand panel, select API Keys ➤ get_image_content_key ➤ API key ➤ show. My key looks like this: uqkOKvci6OajRNCYXvohX9WTY8HpxGzt5vj9eDHV. Add this to the tracking template.
Navigate to API Gateway ➤ get_image_content and on the left hand panel select Resources ➤ Method Request ➤ Settings ➤ API Key Required. Ensure it is set to “true”. If not, edit and click the checkbox then Actions ➤ Deploy API.
AWS documentation is at https://docs.aws.amazon.com/apigateway/latest/developerguide/api-gateway-setup-api-key-with-console.html .
Create API Integration
The use of the AWS API key is not covered in the online video tutorial, hence the preceding section. The API key section is commented out in the following command, which serves as a placeholder. Your API key will differ from mine.
Record these settings in the tracking template. Yours will differ.
API_AWS_IAM_USER_ARN: arn:aws:iam::321333230101:user/vnki-s-ukst5070
API_AWS_EXTERNAL_ID: GH06274_SFCRole=3_8GebOvugHdY0QcaNVeO5Ki/H2+Q=
Establish Trust Relationship
Then click the “Update policy” button. If we later update the API integration, the trust relationship will also need updating.
Create Snowflake External Function
Calling External Function
Deploy Lambda Image
This section modifies parse_image to scrape content from an image file.
Assign IAM Policy
Click the “Review policy” button and then save the changes. You may be prompted to remove an old version of test_policy.
Next, attach policy. Navigate to IAM ➤ Roles ➤ Snowflake_External_Function ➤ Add permissions ➤ Attach policies. You may also find a default role created for parse_image called something like parse_image-role-eu6asv26 to which test_policy should be attached.
Set the checkbox for test_policy and then click the “Attach policies” button. Do not forget to check that the policy has been attached. Navigate to IAM ➤ Roles ➤ Snowflake_External_Function and check that test_policy appears under the “Permissions policies” section.
Invoke Lambda Image
There is a limitation with get_presigned_url function insofar as this must be a fully qualified literal stage name declared as an object and not a quoted string passed through. For more information, see the documentation at https://docs.snowflake.com/en/sql-reference/functions/get_presigned_url.html#get-presigned-url .
The JSON document returns fields and values, but due to the way in which Textract operates, not every key/value pair is returned. A full explanation is beyond the scope of this chapter, but briefly, text in images is bound by boxes before extract, and it is the content of each box that is returned. Examine the returned JSON for information. Note thattrailing spaces are significant for matching specific fields.
Automation
Our next step is to automate document JSON extract and parsing, an exercise left for you to complete using your knowledge of streams and tasks.
Monitoring
In AWS Management Console , navigate to CloudWatch ➤ Log groups, where you see the /aws/lambda/parse_image log group. Click it to show the latest logs.
Note that upstream errors (i.e., those which do not cause Lambda to be invoked) are not recorded. Only Lambda invocations create log entries that may not be successful.
Troubleshooting
Inevitably there will be times when we need to fix our code, and during the creation of this example, I made several, hence my note to use the tracking template . One “gotcha” stands out. In the Method Request” setting, I failed to redeploy Actions ➤ Deploy API. I easily missed it. After which, I encountered a further error relating to the API key. See https://docs.snowflake.com/en/sql-reference/external-functions-creating-aws-troubleshooting.html#request-failed-for-external-function-ext-func-with-remote-service-error-403-message-forbidden to which the answer was to disable the API key from the Method Request.
The last error encountered related to Lambda not having an entitlement to access Textract, specifically analyze_document, as identified in the CloudWatch logs. The solution was to add a policy containing appropriate entitlement to the Lambda function.
The following document offers solutions to common issues: https://docs.snowflake.com/en/sql-reference/external-functions-creating-aws-troubleshooting.html#troubleshooting-external-functions-for-aws .
And for AWS-specific issues, see https://docs.snowflake.com/en/sql-reference/external-functions-creating-aws-troubleshooting.html#label-external-functions-creating-aws-troubleshooting .
Cleanup
Summary
This chapter introduced semi-structured data, focusing on JSON, and loading a sample file into the Snowflake VARIANT data type. We discussed techniques to handle source files of greater than 16 MB. Note that the maximum size a VARIANT can accept is 16 MB.
Noting the self-contained and complex nature of semi-structured data, we utilized materialized views to normalize into a more easily understood format leading to natural integration with our relational format data model. A summary of use cases for the direct query of materialized view creation was also provided. Materialized view limitations were discussed before touching upon automation of semi-structured data ingestion.
Moving on to unstructured data, we dived into unfamiliar territory, looking inside AWS Management Console, configuring a Python lambda function to automatically invoke a Lambda function, and copying and processing an image file into JSON format, thus demonstrating the capability to convert an unstructured image file into a semi-structured format.
Our next example delivered a “hello world” style Lambda function as an external procedure accessible via Snowflake. Using SQL proved that end-to-end connectivity is not a trivial undertaking. We then adapted our working Lambda function to scrape content out of an image document, proving we can convert an image file into a JSON document, albeit with some limitations imposed by Amazon Textract, and then extract specific key/value pairs using SQL. I left automation as an exercise for you to implement using streams and tasks.
Having worked through semi-structured and unstructured data, we now look at the query optimizer and performance tuning.