© The Author(s), under exclusive license to APress Media, LLC, part of Springer Nature 2022
A. CarruthersBuilding the Snowflake Data Cloudhttps://doi.org/10.1007/978-1-4842-8593-0_9

9. Data Presentation

Andrew Carruthers1  
(1)
Birmingham, UK
 

This chapter follows directly from Chapter 8, where we built our sample data pipeline. I now introduce reasoning and logic before building out components to implement data-driven dynamic validations, then identify how to enrich data. We then dive into data masking, row-level security, and object tagging.

The whole purpose of Chapter 2 is to underpin everything we do in this book. There is no point in creating another data silo, but we do not have space to cover everything in this chapter, so I must defer some related topics until later.

Figure 9-1 is becoming difficult to read but illustrates the reference architecture for validation, enrichment, and presentation components.
Figure 9-1

Conceptual data flow

I have referenced multiple schemas in diagrams to reinforce the message to segregate data for security and best practices. However, I have developed everything using the SYSADMIN role, which is not recommended. I leave it for you to segregate into the proper roles.

Reference Data

What do we mean by reference data? Why do we need it? How do we source it? What is reference data?

Reference data consists of commonly accepted industry standard definitions . Country names and currency codes are examples of reference data. They are typically static dimensions, although occasionally, we see the birth of a new country or the adoption of a new currency.

We might think there is a single “golden source” or “system of record” in our organizations. but in fact, many applications maintain their own local reference data, and only a few create (or “master”) reference data for use in other departments. Mastering reference data means the application is the (hopefully) single authoritative source for creating reference data in a specific business domain. And very few organizations have the foresight to centralize their master reference data into a single source, like Snowflake, distributing reference data to all consumers.

There is a difference between externally sourced reference data and internal business-focused reference data. With externally sourced reference data , we should reference commonly accepted data sources such as the International Organization for Standardization at www.iso.org/home.html . In contrast, internal business-focused reference data may be mastered by different parts of our organization, such as finance mastering their chart of accounts or risk mastering data security classifications.

Regardless of where reference data is sourced, we must have the latest up-to-date version (SCD1) and history (SCD2) available for several reasons.
  • Utilizing built-in Snowflake capabilities described in Chapter 3. We have simple data distribution capabilities at our fingertips.

  • We must also ensure data received into Snowflake conforms to our reference data and mark— but not exclude—failing records accordingly.

  • Where record enrichment is required, we rely upon accurate reference data to enhance our customer reporting experience.

Without common reference data, we are certain to encounter data mismatches. We also find people making manual adjustments to correct data in their local system but never feeding back to the golden source. The endless cycle of poor-quality data issues continually repeats.

The manual approach to correct data is labor-intensive, costly, and error-prone because humans make mistakes. Our objective must always be to simplify, automate and remove cottage industries wherever possible. Companies need their brightest and smartest people to add real value and not be tied up with endless administration and data corrections.

Having explained why, how, and what for reference data, let’s now look at practical use cases found in all organizations. I assume you have already created a reference schema and objects and loaded your reference data.

Validation

At the simplest level, we define validation as checking against reference data to ensure conformance. Equally, checks can be in a data set, for example, to ensure an employee is not their own manager.

We need to keep things simple, but as you might imagine, with software development, complexity rears its head at every opportunity, and designing generic processes which can be applied across many feeds is non-trivial.

Validations do not exclude data from flowing through our data pipeline. They simply mark records and attributes as failing validation rules, which may be considered data quality rules. Over time, we collate counts of records failing applied data quality rules leading to metrics such as confidence scores and common or repeat offenders.

Figure 9-2 illustrates what we are building.
Figure 9-2

Validation overview

Data lands in our SCD2 table against which a stream has been declared detecting data landing, enabling the task to begin executing the wrapper stored procedure. We expect to apply several validation routines expressed via JavaScript stored procedures, each implementing a business rule applied via a wrapper stored procedure where each routine runs serially. Data quality exceptions are saved to a table with a view on top, making data available for extraction and feedback to the source. Naturally, we cannot be prescriptive in how data quality exceptions are propagated, but remember the core principle of not allowing corrections in our system. it is for each organization to determine their feedback pattern.

In our example, we hard-code a few rules, but later you see how to dynamically generate rules on the fly.

We reuse code from Chapter 8. Assume our storage integration is configured along with dependent objects to populate scd2_content_test, including the task_load_test_data task.

First, let’s establish our baseline.
USE ROLE      sysadmin;
USE DATABASE  TEST;
USE WAREHOUSE COMPUTE_WH;
USE SCHEMA    public;
CREATE OR REPLACE STAGE TEST.public.test_stage
STORAGE_INTEGRATION = test_integration
DIRECTORY           = ( ENABLE = TRUE AUTO_REFRESH = TRUE )
ENCRYPTION          = ( TYPE = 'SNOWFLAKE_SSE' )
URL                 = 's3://btsdc-test-bucket/'
FILE_FORMAT         = TEST.public.test_pipe_format;
CREATE OR REPLACE STREAM strm_test_stage ON STAGE TEST.public.test_stage;
CREATE OR REPLACE VIEW v_strm_test_stage COPY GRANTS
AS
SELECT '@TEST.public.test_stage/'||relative_path                              AS path_to_file,
       SUBSTR ( relative_path, 1, REGEXP_INSTR ( relative_path, '_20' ) - 1 ) AS table_name,
       size,
       last_modified,
       metadata$action
FROM   strm_test_stage;
ALTER STAGE TEST.public.test_stage REFRESH;
CREATE OR REPLACE TABLE stg_content_test
(
id            NUMBER,
content       VARCHAR(30),
last_updated  TIMESTAMP_NTZ DEFAULT current_timestamp()::TIMESTAMP_NTZ NOT NULL
);
CREATE OR REPLACE TABLE scd1_content_test
(
id            NUMBER,
content       VARCHAR(30),
last_updated  TIMESTAMP_NTZ DEFAULT current_timestamp()::TIMESTAMP_NTZ NOT NULL
);
CREATE OR REPLACE STREAM strm_scd1_content_test ON TABLE scd1_content_test;
CREATE OR REPLACE TABLE scd2_content_test
(
id            NUMBER,
content       VARCHAR(30),
valid_from    TIMESTAMP_NTZ,
valid_to      TIMESTAMP_NTZ,
current_flag  VARCHAR(1),
decision      VARCHAR(100)
);

I assume JavaScript sp_stg_to_scd1 and sp_load_test_data stored procedures, v_content_test, and task_load_test_data exist, with task task_load_test_data suspended.

Upload Test Files into S3

Create a target table to store data quality exceptions generated by our validation routines.
CREATE OR REPLACE TABLE data_quality_exception
(
data_quality_exception_id       NUMBER        NOT NULL,
validation_routine              VARCHAR(255)  NOT NULL,
data_quality_exception_code_id  NUMBER        NOT NULL,
source_object_name              VARCHAR(255)  NOT NULL,
source_attribute_name           VARCHAR(255)  NOT NULL,
source_record_pk_info           VARCHAR(255)  NOT NULL,
insert_timestamp                TIMESTAMP_NTZ DEFAULT current_timestamp()::TIMESTAMP_NTZ NOT NULL
);
Create a local reference table holding the list of exception codes. We would usually create referential integrity, too, even though Snowflake does not enforce referential integrity constraints.
CREATE OR REPLACE TABLE data_quality_exception_code
(
data_quality_exception_code_id          NUMBER        NOT NULL,
data_quality_exception_code_name        VARCHAR(255)  NOT NULL,
data_quality_exception_code_description VARCHAR(255)  NOT NULL,
insert_timestamp                        TIMESTAMP_NTZ DEFAULT current_timestamp()::TIMESTAMP_NTZ NOT NULL
);
Create Sequences to uniquely identify records.
CREATE OR REPLACE SEQUENCE seq_data_quality_exception_id      START WITH 10000;
CREATE OR REPLACE SEQUENCE seq_data_quality_exception_code_id START WITH 10000;
Create a data_quality_exception_code record.
INSERT INTO data_quality_exception_code  VALUES (seq_data_quality_exception_code_id.NEXTVAL, 'ATTRIBUTE_IS_NULL', 'Attribute is declared as NOT NULL but NULL value found', current_timestamp());
Create a simple validation stored procedure sp_is_attribute_null, noting the embedded use of current_flag to select the latest records.
CREATE OR REPLACE PROCEDURE sp_is_attribute_null( P_ROUTINE          STRING,
                                                  P_DQ_EXEP_CODE     STRING,
                                                  P_SOURCE_OBJECT    STRING,
                                                  P_SOURCE_ATTRIBUTE STRING,
                                                  P_SOURCE_PK_INFO   STRING ) RETURNS STRING
LANGUAGE javascript
EXECUTE AS CALLER
AS
$$
   var sql_stmt  = "";
   var stmt      = "";
   var result    = "";
   sql_stmt  = "INSERT INTO data_quality_exception ( data_quality_exception_id, validation_routine, data_quality_exception_code_id, source_object_name, source_attribute_name, source_record_pk_info, insert_timestamp ) "
   sql_stmt += "SELECT seq_data_quality_exception_id.NEXTVAL, "
   sql_stmt += "       :1, "
   sql_stmt += "       ( SELECT data_quality_exception_code_id FROM data_quality_exception_code WHERE data_quality_exception_code_name = :2 ), "
   sql_stmt += "       :3, "
   sql_stmt += "       :4, "
   sql_stmt += "       :5, "
   sql_stmt += "       current_timestamp()::TIMESTAMP_NTZ "
   sql_stmt += "FROM   " + P_SOURCE_OBJECT    + " "
   sql_stmt += "WHERE  " + P_SOURCE_ATTRIBUTE + " IS NULL "
   sql_stmt += "AND    current_flag = 'Y'; ";
   stmt = snowflake.createStatement ({ sqlText:sql_stmt, binds:[P_ROUTINE, P_DQ_EXEP_CODE, P_SOURCE_OBJECT, P_SOURCE_ATTRIBUTE, P_SOURCE_PK_INFO] });
   try
   {
       result = stmt.execute();
       result = "Number of rows found: " + stmt.getNumRowsAffected();
   }
   catch { result = sql_stmt; }
   return result;
$$;
Recognizing we have not loaded any data into our tables, run a test. The result should be 'Number of rows found: 0'.
CALL sp_is_attribute_null ( 'Test',
                            'ATTRIBUTE_IS_NULL',
                            'scd2_content_test',
                            'content',
                            'id');
We should now test for all possible scenarios, so load up data into scd2_content_test, then update the content attribute to be NULL. Re-run the test harness. If we have set our test up correctly, we can expect records in data_quality_exception.
SELECT * FROM data_quality_exception;
Before re-running our test, we must remove previous records from data_quality_exception, always reset the baseline before re-testing and work methodically. There are no shortcuts in software development. Part of the learning is to approach our subject in a logical, consistent, repeatable manner. We need this discipline when developing a continuous integration framework, and for the curious, these repeat steps will be useful as plug-in test cases later.
TRUNCATE TABLE data_quality_exception;
When we are confident, our validation routine works as expected. We can create a new stored procedure sp_validate_test_data to wrap validation procedures. Note that there are two calls. You may wish to add control logic to ensure the first stored procedure execution outcome is properly returned in the event of failure.
CREATE OR REPLACE PROCEDURE sp_validate_test_data() RETURNS STRING
LANGUAGE javascript
EXECUTE AS CALLER
AS
$$
   var stmt      = "";
   var result    = "";
   stmt = snowflake.createStatement ({ sqlText: "CALL sp_is_attribute_null(?,?,?,?,?);",
                                       binds:['Test', 'ATTRIBUTE_IS_NULL', 'scd2_content_test', 'id', 'id'] });
   try
   {
       result = stmt.execute();
       result = "SUCCESS";
   }
   catch { result = sql_stmt; }
   stmt = snowflake.createStatement ({ sqlText: "CALL sp_is_attribute_null(?,?,?,?,?);",
                                       binds:['Test', 'ATTRIBUTE_IS_NULL', 'scd2_content_test', 'content', 'id'] });
   try
   {
       result = stmt.execute();
       result = "SUCCESS";
   }
   catch { result = sql_stmt; }
   return result;
$$;
And test our wrapper.
CALL sp_validate_test_data();
Remember to grant entitlement to execute tasks to sysadmin.
USE ROLE accountadmin;
GRANT EXECUTE TASK ON ACCOUNT TO ROLE sysadmin;
And now, create a task triggered by the presence of new data in strm_scd2_content_test with our new sp_validate_test_data stored procedure.
USE ROLE sysadmin;
Create a stream on table strm_scd2_content_test. we use this to trigger our validation wrapper task.
CREATE OR REPLACE STREAM strm_scd2_content_test ON TABLE scd2_content_test;
Create a new task to run validations.
CREATE OR REPLACE TASK task_validate_test_data
WAREHOUSE = COMPUTE_WH
SCHEDULE  = '1 minute'
WHEN system$stream_has_data ( 'strm_scd2_content_test' )
AS
CALL sp_validate_test_data();
Set task to run.
ALTER TASK task_validate_test_data RESUME;
Check task status.
SHOW tasks;
Identify next scheduled run time.
SELECT timestampdiff ( second, current_timestamp, scheduled_time ) as next_run,
       scheduled_time,
       current_timestamp,
       name,
       state
FROM   TABLE ( information_schema.task_history())
WHERE  state = 'SCHEDULED'
ORDER BY completed_time DESC;
By now, you should be comfortable with testing the data pipeline from resetting the test case baseline, loading files from S3, ingesting them into the SCD2 table, and running validations. Once your testing is complete, suspend the task.
ALTER TASK task_load_test_data SUSPEND;

While a simple example has been provided, more complex validations should be considered. Examples include lookups to reference data to ensure attributes match known, approved values and internal data consistency checks, such as ensuring an employee is not their own manager. The possibilities are almost endless, and the more complete our data validations are, the better.

Having worked through examples of how to both identify and record data quality exceptions, let’s turn our attention to enriching data.

Data Quality Exceptions

Having run our validation routines , the presence of data quality exceptions generates metrics we can use for several purposes. We can determine the quality of data in our source systems, provide a measure of data quality improvements over time, and give consumers confidence in our data’s fidelity.

Once we have generated data quality exceptions, we must decide what to do with the data. Most organizations do not have a clearly defined feedback route to the source. Still, the generation of data quality exceptions exposes gaps or issues with the supplied data. From previous discussions, you know correction in a data warehouse is the wrong action to allow.

Ideally, we would automatically feedback data to the source for correction, leading to a virtuous circle where the number of data quality exceptions reduces to near zero over time. An initial objective should be set to reduce the time to correct data between submission and validation cycles.

Options to identify data quality exceptions include provisioning database views and associated reporting screens, automated email notifications, and periodic extract to flat files in a shared directory. You may have your own preferred option.

I do not propose a means to manage, accept, or clear data quality exceptions for our data warehouse. The operational management is left for you to determine.

Enrichment

Data enrichment is the process by which we add attributes into a data set to make the data more easily understood, business-focused, and useful to the end consumer. Here we focus on the data itself and not the metadata definitions, which are addressed later in this chapter with object tagging. Figure 9-3 introduces enrichment views into our data pipeline.
Figure 9-3

Enrichment overview

Often our business colleagues have poorly articulated expectations and make assumptions over the artifacts we deliver to them. As software developers, we must always consider the business outcomes of our work. Everything we do has a cost and should (but not always) deliver tangible business benefits.

A practical example of data enrichment is to expand an ISO country code from the three-letter alpha code (e.g., USA) to the entire country name, (e.g., United States of America), along with any other attributes required by our users or we proactively think they will find useful. See https://en.wikipedia.org/wiki/List_of_ISO_3166_country_codes here for further information.

There is nothing particularly clever about this approach. Every reporting data set has joins to reference data. Hence we are not diving into any particular design pattern or code sample. Instead, we reiterate the point of not excluding records by using INNER JOINs . We prefer to use OUTER JOINs with default values for missing attributes to preserve the principle of preserving all records throughout the data lifecycle . Our operational support teams have better things to do than chase a single missing record.

Dynamic Data Masking

In our data sets, we find a wide variety of content, some of which is highly sensitive for various reasons . We may also have a requirement to mask production data for use in lower environments such as user acceptance testing or development where the use of production data is prohibited. Although cloning databases is (usually) trivial in Snowflake, to do so when generating a development environment may expose data. Snowflake documentation introduces the subject of data masking at https://docs.snowflake.com/en/user-guide/security-column-ddm-intro.html .

Data masking may also be applied to external tables but not at the point of creation, only by altering after that, as the metadata required does not exist until after creation. Further restrictions on unmasking policy usage apply. Please refer to the documentation at https://docs.snowflake.com/en/user-guide/security-column-intro.html#label-security-column-intro-cond-cols .

Dynamic data masking determines what the user sees and should be considered as horizontal filtering of attribute values. In other words, data masking does not filter the rows returned by a query, only the visible content displayed.

Note

Apply data masking selectively. There is a performance impact where it has been excessively implemented.

Simple Masking

this section dives into implementing data masking with hands-on demonstration. First, we must establish a maskingadmin administrative role to administer our tags.
USE ROLE securityadmin;
CREATE OR REPLACE ROLE maskingadmin;
GRANT CREATE MASKING POLICY ON SCHEMA TEST.public TO ROLE maskingadmin;
Grant entitlement to use the schema we allow masking policies to be created. In this example, we use the TEST database and PUBLIC schema.
GRANT USAGE ON DATABASE  TEST        TO ROLE maskingadmin;
GRANT USAGE ON WAREHOUSE compute_wh  TO ROLE maskingadmin;
GRANT USAGE ON SCHEMA    TEST.public TO ROLE maskingadmin;
Grant maskingadmin to your user.
GRANT ROLE maskingadmin TO USER <YOUR_USER>;
Because APPLY MASKING is set at the account level, we must use the ACCOUNTADMIN role.
USE ROLE accountadmin;
GRANT APPLY MASKING POLICY ON ACCOUNT TO ROLE maskingadmin;
Set context.
USE ROLE      sysadmin;
USE DATABASE  TEST;
USE WAREHOUSE COMPUTE_WH;
USE SCHEMA    public;
Create the masking_test table.
CREATE OR REPLACE TABLE masking_test
(
user_email        VARCHAR(30)  NOT NULL,
user_email_status VARCHAR(30)  NOT NULL
);
Create a stream on the masking_test table.
CREATE OR REPLACE STREAM strm_masking_test ON TABLE masking_test;
Create two rows in the masking_test table.
INSERT INTO masking_test
VALUES
('user_1@masking_test.com', 'Public' ),
('user_2@masking_test.com', 'Private');
Prove we have two rows in our masking_test table.
SELECT * FROM masking_test;
Create a View on table masking_test;
CREATE OR REPLACE VIEW TEST.public.v_masking_test
AS
SELECT * FROM masking_test;
Grant entitlement to maskingadmin on masking_test table and v_masking_test view.
GRANT SELECT ON masking_test      TO ROLE maskingadmin;
GRANT SELECT ON v_masking_test    TO ROLE maskingadmin;
GRANT SELECT ON strm_masking_test TO ROLE maskingadmin;
At this point no masking policies exist.
SHOW masking policies;
SHOW masking policies IN ACCOUNT;
Switch to the maskingadmin role and set schema context.
USE ROLE      maskingadmin;
USE DATABASE  TEST;
USE WAREHOUSE COMPUTE_WH;
USE SCHEMA    TEST.public;
Prove we can see unmasked data in masking_test.
SELECT * FROM masking_test;
Figure 9-4 shows the expected output from our query.
Figure 9-4

Unmasked raw data

This step is only required if the whole test case has been re-run from the start and requires some explanation. If the table is re-created after the masking policy has been applied to an attribute, then it is not possible to re-create the masking policy. The behavior is inconsistent with expectations, and the next few lines are a work-around.
USE ROLE accountadmin;
DROP MASKING POLICY IF EXISTS dq_code_mask;
Revert to the maskingadmin role.
USE ROLE maskingadmin;
Create a masking policy.
CREATE OR REPLACE MASKING POLICY dq_code_mask AS ( P_PARAM STRING ) RETURNS STRING ->
CASE
   WHEN current_role() IN ('SYSADMIN') THEN P_PARAM
   ELSE '*********'
END;
Check that the masking policy has been created.
SHOW masking policies LIKE 'dq_code_mask';
Apply masking policy to an attribute on the masking_test table.
ALTER TABLE TEST.public.masking_test
MODIFY COLUMN user_email SET MASKING POLICY dq_code_mask;
Check masking_test.user_email is now displayed as '*********'.
SELECT * FROM masking_test;
Figure 9-5 shows the expected output from our query.
Figure 9-5

Masked user_email

This simple demonstration shows the effect of adding a masking policy to a single table. But for our data pipelines, we also need to cascade the masking policy to other objects in our pipeline. Let’s see what happens when we create a view on top of our table.
SELECT * FROM v_masking_test;

The result set is identical to Figure 9-5.

And now a stream.
SELECT * FROM strm_masking_test;
Figure 9-6 shows the expected output from our query.
Figure 9-6

Masked stream output

For both the view and stream, the data remains masked.

Change role to SYSADMIN and ensure masking_test.user_email is now displayed unmasked.
USE ROLE sysadmin;
SELECT * FROM masking_test;
Figure 9-7 shows the expected output from our query.
Figure 9-7

Unmasked raw data

Reset role to maskingadmin.
USE ROLE maskingadmin;
Unset the masking policy for masking_test.user_email.
ALTER TABLE TEST.public.masking_test
MODIFY COLUMN user_email UNSET MASKING POLICY;
And drop the masking policy.
DROP MASKING POLICY dq_code_mask;

Conditional Masking

So far, our test case has used the user’s current role to determine whether an attribute should be masked or not. A more sophisticated approach is to extend the masking policy to refer to a related attribute, as explained at https://docs.snowflake.com/en/user-guide/security-column-intro.html#label-security-column-intro-cond-cols .

In simple terms, we can now apply data masking by either role or attribute value, reusing our test case. We now rely upon the value of attribute user_email_status to determine masking outcome.

Note the masking policy prototype has been extended to two parameters with data types.
CREATE OR REPLACE MASKING POLICY dq_code_mask AS ( user_email VARCHAR, user_email_status VARCHAR ) RETURNS STRING ->
CASE
   WHEN current_role() IN ('SYSADMIN') THEN user_email
   WHEN user_email_status = 'Public'   THEN user_email
   ELSE '*********'
END;
Likewise, setting the masking policy for an attribute has also been extended.
ALTER TABLE TEST.public.masking_test
MODIFY COLUMN user_email
SET MASKING POLICY dq_code_mask
USING (user_email, user_email_status);
The remainder of our test cases are identical to the preceding one.
SELECT * FROM masking_test;
Figure 9-8 shows the expected output from our query with masking applied according to the user_email_status.
Figure 9-8

Conditional data masking

SELECT * FROM v_masking_test;

The result set is identical to Figure 9-8.

And now a stream .
SELECT * FROM strm_masking_test;
Figure 9-9 shows the expected output from our query.
Figure 9-9

Masked stream output

For both view and stream, the data remains masked.

Revert to the sysadmin role and re-test.
USE ROLE sysadmin;
SELECT * FROM masking_test;
Figure 9-10 shows the expected output from our query.
Figure 9-10

Unmasked raw data

Cleanup also remains the same.
USE ROLE maskingadmin;
ALTER TABLE TEST.public.masking_test
MODIFY COLUMN user_email UNSET MASKING POLICY;
DROP MASKING POLICY dq_code_mask;

Data Masking Considerations

With a simplistic test case, data masking is readily demonstrated, but we are in the business of building data pipelines and must consider how roles and data masking interact. Let’s assume our cybersecurity colleagues impose a policy that states: All confidential attributes must be masked from the point of ingestion into Snowflake, through all internal processing, until the ultimate point of consumption where confidential attributes must be unmasked. In other words, the security policy states specific attributes that persisted in Snowflake must have data masking applied and only be visible after leaving Snowflake.

The security policy would be easily achieved if we had a single role that handles every step from data ingestion to data consumption. A single masking policy with a single role would suffice. But, our data lives in a highly secure environment, with multiple actors with specific roles operating on our data throughout its lifecycle. It is incumbent upon us to ensure security policies are enforced. Figure 9-11 articulates a real-world scenario where roles and interactions pose challenges in preserving data masking as data transits our pipeline. Let’s categorize our roles as follows.
  • Ingestion: The role that lands data into our staging table must not have entitlement to read the application table or presentation view.

  • Persistence: The role that moves data through our pipeline from staging table to presentation view must not have entitlement to read the presentation view.

  • Consumption: The role that presents data to the consuming service must not have entitlement to read either the staging table or application table.

Figure 9-11

Real-world data masking

The challenge with implementing data masking relates to the context required to allow each actor to read data unmasked while saving it into tables in the data pipeline where masking policies are applied. The answer is to carefully consider the masking policy declaration and only include the minimum required context allowing clear text to be returned.

Masking policies are agnostic and can be applied to any attribute. The context contained in the masking policy determines what the invoker sees. Therefore a few masking policies can be declared and applied ubiquitously across our data model according to the boundary at which they sit and the objects they protect.

Making changes to existing masking policies comes at a price. The masking policy after redefinition must then be re-tested, and the application testing process is both time-consuming and expensive.

Another use for data masking is to obfuscate production data rendering the content suitable for use in a testing environment. If the rendered values are acceptable and do not remove required data, then data masking policies may work very well.

The key to success with data masking is to apply masking policies selectively. Every SELECT statement invokes the masking policy applied to chosen attributes, therefore incurring additional processing overhead to unmask or partially mask attributes. It is recommended that only those attributes determined to be highly confidential, highly restricted, or commercially sensitive be masked. Not forgetting to conduct testing with real-world scenarios to ensure masking is applied correctly, then load test using production-like data volumes.

Advanced Topics

External tokenization is an extension to data masking where data can be tokenized before loading into Snowflake. As tokenization relies upon third-party tooling, we do not discuss it further. Instead, refer to the documentation at https://docs.snowflake.com/en/user-guide/security-column-ext-token.html .

In our test cases, we relied upon current_role() to determine the execution context, a broader range of context functions available for use in masking policies is explained at https://docs.snowflake.com/en/user-guide/security-column-advanced.html#advanced-column-level-security-topics .

Row-Level Security (RLS)

While data masking determines what the user sees in terms of attribute content displayed, row-level security (RLS) applies filters that determine the actual rows returned based upon a role. Snowflake also refers to it as row access policies. Documentation is at https://docs.snowflake.com/en/user-guide/security-row.html .

From Chapter 5, you know object level entitlement is granted to roles, and with any given role, if we have entitlement to SELECT from an object, then we have access to all the data in the table.

In contrast, RLS applies filters to the object content, restricting returned data to only those rows matching the RLS definition. Figure 9-12 illustrates the difference in approach between data masking and RLS. The actual data is of no real interest. But for the curious, it derived from lathe tool cutting speeds for different materials. The effects are what we need to know.
Figure 9-12

Data masking and row-level security

Having explained RLS and the impact, let’s look at a simple worked example. First, we must either create or entitle an existing role, and for simplicity, we reuse maskingadmin as the capabilities of RLS are compatible with data masking.

USE ROLE accountadmin;
GRANT APPLY ROW ACCESS POLICY ON ACCOUNT TO ROLE maskingadmin;

We rely upon previous entitlement grants to maskingadmin hereafter and expect masking_test table to exist.

Switch back to the sysadmin role and create a row access policy.
USE ROLE      sysadmin;
USE DATABASE  TEST;
USE WAREHOUSE COMPUTE_WH;
USE SCHEMA    public;
CREATE OR REPLACE ROW ACCESS POLICY sysadmin_policy
AS ( user_email VARCHAR ) RETURNS BOOLEAN ->
'SYSADMIN' = current_role();
Now apply the row access policy to masking_test.user_email.
ALTER TABLE TEST.public.masking_test
ADD ROW ACCESS POLICY sysadmin_policy ON ( user_email );
You should now see the row access policy.
SELECT *
FROM   TABLE ( information_schema.policy_references ( policy_name => 'sysadmin_policy' ));
Check that you can see expected rows.
SELECT * FROM masking_test;
Change roles and re-check. You should not see any rows returned, but there should be no error indicating the query ran, but the rows were filtered out.
USE ROLE      maskingadmin;
SELECT * FROM masking_test;
Change back.
USE ROLE      sysadmin;
Before a row access policy can be dropped, we must remove it from all assigned tables.
ALTER TABLE TEST.public.masking_test DROP ROW ACCESS POLICY sysadmin_policy;
DROP ROW ACCESS POLICY sysadmin_policy;

What did we do? We created a simple row access policy that prevents access to data except when SYSADMIN is the current role, even if any other role has SELECT access to the table. And yes, this is a trivial example used to illustrate how row access policy statements are used, also to provide a template for extension into a more useful pattern as at https://docs.snowflake.com/en/user-guide/security-row-using.html .

Object Tagging

This section addresses a complex subject with far-reaching consequences necessitating explanation. As you become aware, our data must be protected as it transits through our systems. I have showcased both design techniques and implementation patterns adhering to best practices.

But at some point, we need a way of categorizing our data by object and attribute, making data discovery (relatively) easy. We can implement data discovery via external tooling such as Collibra, exposing the data model and metadata. This approach remains valid while the Collibra catalog and Snowflake information schemas are aligned.

Another way to enable data discovery is to tag each object and attribute with one or more tags. Supporting documentation is at https://docs.snowflake.com/en/user-guide/object-tagging.html .

Tag Uniqueness

We assume that organizations want to use an external tool such as Collibra to implement object tagging, and some general rules must be applied to ensure consistency across our estate.

Object tags are not a unique feature of Snowflake. However, their value to an organization is only realized if tags are unique across the organization regardless of the location and system where they are applied. Without enforced uniqueness, the likelihood of duplicate tags enables data to be misidentified and potentially disclosed inappropriately.

It is strongly recommended that tag creation and maintenance be centralized into a single team whose primary role is the maintenance of a unique organization tag taxonomy, the golden source of all tags used in the organization. Please be aware of the implications of changing or reorganizing tags when managed externally to Snowflake. The same changes must be reflected in our data warehouse, and lineage must be preserved.

Note

Snowflake limits the number of tags in an account to 10,000 and limits the number of allowed values for a single tag to 20.

Tag Taxonomy

We might decide to define our taxonomies according to the line of business, function, or capability with a root node from which all other usages down to the leaf are represented by a colon delimiter. By way of example, let’s look at privacy data as this domain has a small set of tags noting the ones presented here are for demonstration purposes only derived from www.investopedia.com/terms/p/personally-identifiable-information-pii.asp . Feel free to reuse but check that the definitions match your organization’s requirements.

Our taxonomy begins with the top-level PII (Personally Identifiable Information) with two subcategories for sensitive and non-sensitive attributes. For each of the two subcategories, we select a few identifiers, the full lists are more extensive, but we need a representative sample only.

For sensitive information, we use full name and Social Security number; for non-sensitive information, we use gender and date of birth. Our taxonomy may look like the following. Underscores are used to separate fields.
  • PII

  • PII_S_FullName

  • PII_S_SSN

  • PII_N_Gender

  • PII_N_DoB

To make the tags concise, we may use abbreviations. Naturally, we would use a short name and full description along with SCD2 historization when entered in our cataloging tool for management purposes.

Single Tag Value Implementation

With our tags identified, we can start to build out our sample implementation creating a new role for object tagging—tagadmin . We assume familiarity with the following code. and for further information, refer to the Snowflake documentation at https://docs.snowflake.com/en/user-guide/object-tagging.html#object-tagging .
USE ROLE securityadmin;
CREATE OR REPLACE ROLE tagadmin;
GRANT USAGE ON DATABASE  TEST        TO ROLE tagadmin;
GRANT USAGE ON WAREHOUSE compute_wh  TO ROLE tagadmin;
GRANT USAGE ON SCHEMA    TEST.public TO ROLE tagadmin;
Tags are applied in a schema.
GRANT CREATE TAG ON SCHEMA TEST.public TO ROLE tagadmin;
Assign to your user.
GRANT ROLE tagadmin TO USER <YOUR_USER>;
Entitle tagadmin to apply tags to the account.
USE ROLE accountadmin;
GRANT APPLY TAG ON ACCOUNT TO ROLE tagadmin;

We must set the context before attempting to work with object tags.

Create a test table and load sample data .
USE ROLE      sysadmin;
CREATE OR REPLACE TABLE pii_test
(
id                     NUMBER,
full_name              VARCHAR(255),
social_security_number VARCHAR(255),
gender                 VARCHAR(255),
date_of_birth          TIMESTAMP_NTZ
);
CREATE OR REPLACE SEQUENCE seq_pii_test_id      START WITH 10000;
INSERT INTO pii_test VALUES (seq_pii_test_id.NEXTVAL, 'John Doe', '12345678', 'Male', current_timestamp()), (seq_pii_test_id.NEXTVAL, 'Jane Doe', '23456789', 'Female', current_timestamp());
See documentation for allowable entitlement options at https://docs.snowflake.com/en/user-guide/security-access-control-privileges.html#tag-privileges .
GRANT SELECT ON pii_test TO ROLE tagadmin;
Switch to the tagadmin role, noting we have segregated the object creation and data from the tagging. This is another example of separating capabilities via RBAC.
USE ROLE      tagadmin;
USE DATABASE  TEST;
USE WAREHOUSE COMPUTE_WH;
USE SCHEMA    public;
Create sample tags. The documentation is at https://docs.snowflake.com/en/sql-reference/sql/create-tag.html#create-tag .
CREATE OR REPLACE TAG PII            COMMENT = 'Personally Identifiable Information';
CREATE OR REPLACE TAG PII_S_FullName COMMENT = 'Personally Identifiable Information -> Sensitive -> Full Name';
CREATE OR REPLACE TAG PII_S_SSN      COMMENT = 'Personally Identifiable Information -> Sensitive -> Social Security Number';
CREATE OR REPLACE TAG PII_N_Gender   COMMENT = 'Personally Identifiable Information -> Non-Sensitive -> Gender';
CREATE OR REPLACE TAG PII_N_DoB      COMMENT = 'Personally Identifiable Information -> Non-Sensitive -> Date of Birth';
The following displays tags, noting context.
SHOW tags;
Ensure we can retrieve data from pii_test.
SELECT * FROM pii_test;
With tags declared, we can now assign to pii_test.
ALTER TABLE pii_test SET TAG PII = 'Personally Identifiable Information';
ALTER TABLE pii_test MODIFY COLUMN full_name              SET TAG PII_S_FullName = 'Personally Identifiable Information -> Sensitive -> Full Name';
ALTER TABLE pii_test MODIFY COLUMN social_security_number SET TAG PII_S_SSN      = 'Personally Identifiable Information -> Sensitive -> Social Security Number';
ALTER TABLE pii_test MODIFY COLUMN gender                 SET TAG PII_N_Gender   = 'Personally Identifiable Information -> Non-Sensitive -> Gender';
ALTER TABLE pii_test MODIFY COLUMN date_of_birth          SET TAG PII_N_DoB      = 'Personally Identifiable Information -> Non-Sensitive -> Date of Birth';
Fetch the tags associated with pii_test.
SELECT *
FROM   TABLE ( TEST.information_schema.tag_references( 'pii_test', 'TABLE' ));
Fetch the tags associated with the full_name attribute. Note that the TABLE tag is inherited. Tags applied to an object are automatically applied to the attributes.
SELECT *
FROM   TABLE ( TEST.information_schema.tag_references( 'pii_test.full_name', 'COLUMN' ));
We can also fetch the label associated with each tag.
SELECT system$get_tag ( 'PII', 'pii_test', 'TABLE' );
SELECT system$get_tag ( 'PII_S_FullName', 'pii_test.full_name', 'COLUMN' );
For a named table we can also use this query.
SELECT *
FROM   TABLE ( information_schema.tag_references_all_columns ( 'pii_test', 'TABLE' ));

Multiple Tag Value Implementation

We can also assign multiple values to a tag noting the limitation of 20 values per tag. Note that at the time of writing this book, this feature is in public preview as documented at https://docs.snowflake.com/en/sql-reference/sql/alter-tag.html#specifying-allowed-values .
ALTER TAG PII ADD ALLOWED_VALUES 'Personally Identifiable Information', 'PII Admin: pii@your_org.xyz';
The following displays the allowed values for a tag.
SELECT system$get_tag_allowed_values ( 'TEST.public.PII' );
The following unsets them.
ALTER TAG PII UNSET ALLOWED_VALUES;

Tag Identification

So far, our work has focused on the bottom-up build of object tagging capability. Our users are less interested in programmatic constructs. They need a way to quickly identify tags of interest by searching and filtering all available tags in our data warehouse, selecting, and then applying their criteria to drill down into the information required. And there are challenges with dynamically driving table functions. The parameters do not accept attributes passed in a SQL query. Let’s turn our attention to a top-down view of object tagging, and for this, we must extend our Account Usage store knowledge.

Please refer to Chapter 6 for details on accessing the Account Usage store.

To satisfy our users' need to identify available tags, and noting latency for tags and tag_references are stated to be 2 hours at https://docs.snowflake.com/en/sql-reference/account-usage.html#account-usage-views . Run the below query:
SELECT *
FROM   snowflake.account_usage.tags
WHERE  deleted IS NULL
ORDER BY tag_id;
The following identifies where objects have been tagged.
SELECT *
FROM   snowflake.account_usage.tag_references
ORDER BY tag_name, domain, object_id;

With these queries , we can create a view for end user access. I leave this as an exercise for you to consolidate your knowledge.

Tag Cleanup

Here’s how we unset tags .
ALTER TABLE pii_test UNSET TAG PII;
ALTER TABLE pii_test MODIFY COLUMN full_name            UNSET TAG PII_S_FullName;
ALTER TABLE pii_test MODIFY COLUMN social_security_number UNSET TAG PII_S_SSN;
ALTER TABLE pii_test MODIFY COLUMN gender               UNSET TAG PII_N_Gender;
ALTER TABLE pii_test MODIFY COLUMN date_of_birth          UNSET TAG PII_N_DoB;
And drop tags.
DROP TAG PII;
DROP TAG PII_S_FullName;
DROP TAG PII_S_SSN;
DROP TAG PII_N_Gender;
DROP TAG PII_N_DoB;

Data Presentation

Outbound data from our data warehouse is the point at which our users interact to select data sets, and reporting tooling delivers business value. Several options are discussed in Chapter 13, but first, a note of warning. Data modeling can be a fractious subject to be approached with a degree of sensitivity.

I offer some general comments without wishing to take sides or inflame the debate. Utilizing outbound data may include aggregations, summarizations, and filters, which may lend weight to a presentation layer design depending upon the use cases. Also, we cannot ignore data governance because, as you see in Chapter 12, the ability to control access and make visible who can see what is most important.

We do not describe or propose specific outbound data patterns as space does not permit and defer to Chapter 13 for a more in-depth discussion.

Data Quality Exceptions

We have identified data quality exceptions and the need to propagate back to the source. In Snowflake, a secure view provides a simple mechanism to extract data from individual sources on demand. We do not anticipate anything more than a response file generated per feed ingestion on a 1:1 basis plus a user interface where aggregated or summary metrics with trends over time are provisioned.

Tag-Based Query

Adopting tag-based queries is more promising for self-service users. However, a high degree of complexity awaits the unwary. While technically not difficult to provision automatic serving up of data, the ever-changing governance aspects in multi-jurisdiction organizations make this approach impossible even where a comprehensive suite of tags is deployed across multiple dimensions, never mind the tag maintenance aspect.

We envisage a tag-based query as a tool used internally in an organization to identify objects and attributes of interest. With object tagging, primary use cases include privacy impact assessments or data subject access requests. Naturally, you may identify your own use cases, and those suggested are not considered exclusive.

Cleanup

Apart from cleanup scripts to remove our test code, we should also be mindful to periodically remove redundant files uploaded to internal stages remembering each file contributes toward storage costs.
USE ROLE      sysadmin;
USE DATABASE  TEST;
USE SCHEMA    public;
DROP TASK     task_load_test_data;
DROP DATABASE test;
USE ROLE      accountadmin;
DROP STORAGE INTEGRATION test_integration;
REVOKE EXECUTE TASK ON ACCOUNT FROM ROLE sysadmin;

Summary

This chapter introduced reference data and its importance in validating data quality. I then explained how to extend data pipelines by providing a validation that showcases how we can implement parameter-driven stored procedures to make them generic and reusable across all tables and views.

Our discussion moved on to handling the data quality exceptions generated by validation routines and enriching our data by joining reference data.

Dynamic data masking proved to be an interesting and surprisingly complex topic. The chapter provided insight, a test case, and a basis for further investigation and self-learning.

We then investigated row-level security and had a hands-on practical examination of RLS before moving to object tagging. This evolving subject-provoking system design enables users to interact with tags without programming knowledge.

Working through this chapter has proven personally to be very interesting. I learned a lot, including that most topics are evolving, with some features not yet generally available. The best advice is to evaluate each section in conjunction with the documentation when ready to build capability and refactor accordingly.

Having extended data pipelines into end-to-end delivery, let’s look at structured, semi-structured, and unstructured data.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.220.53.93