Overview of bulk data loading: We will cover what bulk data loading is, file load locations, supported file formats and encoding, compression handling, and encryption options.
Bulk data loading recommendations: We will cover file preparation including file sizing and splitting, the CSV and semistructured formats, staging, loading, and querying.
Note
Continuous data loading with Snowpipe is covered in Chapter 6.
Overview of Bulk Data Loading
What is bulk data loading?
Where can we bulk data load from?
What are the compression and encryption options?
What file formats are supported?
What Is Bulk Data Loading?
To get data into a database table, you need to insert it. Insert statements can take a while since they need to be executed one row at a time. Bulk copying can take a large amount of data and insert it into a database all in one batch. The bulk data loading option in Snowflake allows batch loading of data from files that are in cloud storage, like AWS S3.
If your data files are not currently in cloud storage, then there is an option to copy the data files from a local machine to a cloud storage staging area before loading them into Snowflake. This is known as Snowflake’s internal staging area. The data files are transmitted from a local machine to an internal, Snowflake-designated, cloud storage staging location and then loaded into tables using the COPY command.
Tip
External tables can be created instead of loading data into Snowflake. This would be useful when only a portion of data is needed.
Bulk Load Locations
Snowflake-designated internal storage staging location
AWS S3, where files can be loaded directly from any user-supplied S3 bucket
GCP Cloud Storage, where files can be loaded directly from any user-supplied GCP Cloud Storage container
Azure Blob storage, where files can be loaded directly from any user-supplied Azure container
Note
Data transfer billing charges may apply when loading data from files staged across different platforms. Refer to Chapter 2 for more information.
Supported File Formats and Encoding
Delimited files (any valid delimiter is supported; the default is a comma)
JSON and XML
Avro, including the automatic detection and processing of staged Avro files that were compressed using Snappy
ORC, including the automatic detection and processing of staged ORC files that were compressed using Snappy or zlib
Parquet, including the automatic detection and processing of staged Parquet files that were compressed using Snappy
For delimited files, the default character set is UTF-8. To use any other characters set, you must explicitly specify the encoding to use for loading. For all other supported file formats (JSON, Avro, etc.), the only supported character set is UTF-8.
Note
Many character encoding sets are supported for the loading of delimited files. Refer to Snowflake’s online documentation for more details on which character encodings are supported.
Compression Handling
When staging uncompressed files in a Snowflake stage, the files are automatically compressed using gzip, unless compression is explicitly disabled. Snowflake can automatically the detect gzip, bzip2, deflate, and raw_deflate compression methods. Autodetection is not yet supported for brotli and zstandard. Therefore, when staging or loading files compressed with either of these methods, you must explicitly specify the compression method that was used.
Encryption Options
When staging unencrypted files in an internal Snowflake location, the files are automatically encrypted using 128-bit keys. 256-bit keys can be enabled (for stronger encryption); however, additional configuration is required. Files that are already encrypted can be loaded into Snowflake from external cloud storage; the key used to encrypt the files must be provided to Snowflake.
Bulk Data Loading Recommendations
Loading large data sets can affect query performance. Snowflake recommends dedicating separate warehouses to loading and querying operations to optimize the performance for each. In this section, we will cover the recommended ways to prepare the files.
File Preparation
The number of data files that can be processed in parallel is determined by the number and capacity of servers in a warehouse. If you follow the file sizing guidelines described in the following section, the data loading will require minimal resources. Note that these recommendations apply to both bulk data loads and continuous loading using Snowpipe.
File Sizing
The number of load operations that can run in parallel cannot exceed the number of data files to be loaded.
To optimize the number of parallel operations for a load, we recommend aiming to produce data files roughly 10 MB to 100 MB in size, compressed.
Aggregate smaller files to minimize the processing overhead for each file.
Split larger files into a greater number of smaller files to distribute the load among the servers in an active warehouse. The number of data files processed in parallel is determined by the number and capacity of servers in a warehouse.
Snowflake recommends splitting large files by line to avoid records that span chunks.
Data loads of large Parquet files (e.g., greater than 3 GB) could time out. Split large files into files 1 GB in size (or smaller) for loading.
File Splitting
If your source database does not allow you to export data files in smaller chunks, use a third-party utility to split large CSV files. Windows does not include a native file split utility; however, Windows supports many third-party tools and scripts that can split large data files. Linux has the split utility, which enables you to split a CSV file into multiple smaller files.
Note
Splitting larger data files allows the load to scale linearly. Using a larger warehouse (X-Large, 2X-Large, etc.) will consume more credits and may not result in any performance increase.
CSV File Preparation
UTF-8 is the default character set; however, additional encodings are supported. Use the ENCODING file format option to specify the character set for the data files.
Snowflake supports ASCII characters (including high-order characters) as delimiters. Common field delimiters include the pipe (|), comma (,), caret (^), and tilde (~).
A field can be optionally enclosed by double quotes, and, within the field, all special characters are automatically escaped, except the double quote itself needs to be escaped by having two double quotes right next to each other (“”). For unenclosed fields, a backslash () is the default escape character.
Common escape sequences can be used (e.g., for tab, for newline, for carriage return, and \ for backslash).
Fields containing carriage returns should also be enclosed in quotes (single or double).
The number of columns in each row should be consistent.
Semistructured Data File Preparation
Semistructured data is data that does not conform to the standards of traditional structured data, but it contains tags or other types of markup that identify individual, distinct entities within the data.
Structured data requires a fixed schema that is defined before the data can be loaded and queried in a relational database system. Semistructured data does not require a prior definition of a schema and can constantly evolve; i.e., new attributes can be added at any time.
In addition, entities within the same class may have different attributes even though they are grouped together, and the order of the attributes is not important.
Unlike structured data, which represents data as a flat table, semistructured data can contain n level of hierarchies of nested information.
The steps for loading semistructured data into tables are identical to those for loading structured data into relational tables. Snowflake loads semistructured data into a single VARIANT column. You can also use a COPY INTO table statement during data transformation to extract selected columns from a staged data file into separate table columns.
When semistructured data is inserted into a VARIANT column, what Snowflake is really doing is extracting information about the key locations and values and saving it into a semistructured document. The document is referenced by the metadata engine for fast SQL retrieval.
Note
VARIANT “null” values (not to be confused with SQL NULL values) are not loaded to the table. To avoid this, extract semistructured data elements containing “null” values into relational columns before loading them, Alternatively, if the “null” values in your files indicate missing values and have no other special meaning, Snowflake recommends setting the file format option STRIP_NULL_VALUES to TRUE when loading the semistructured data files.
File Staging
Both internal and external stage locations in Snowflake can include a path (referred to as a prefix in AWS). When staging regular data sets, Snowflake recommends partitioning the data into logical paths to identify details such as geographical location, along with the date, when the data is written.
Organizing your data files by path allows you to copy the data into Snowflake with a single command. This allows you to execute concurrent COPY statements that match a subset of files, taking advantage of parallel operations.
NA/Mexico/Quintana_Roo/Cancun/2020/01/01/01/
EU/France/Paris/2020/01/01/05/
If the file names match except for a suffix or extension, include the matching part of the file names in the path. Here’s an example:
- Add the FILES or PATTERN option. Here’s an example:
copy into t1 from @%t1/united_states/california/los_angeles/2016/06/01/11/ files=('mydata1.csv', 'mydata1.csv');
copy into t1 from @%t1/united_states/california/los_angeles/2016/06/01/11/ pattern='.*mydata[^[0-9]{1,3}$$].csv';
When planning regular data loads, such as with extract-transform-load (ETL) processing , it is important to partition the data in your internal (i.e., Snowflake) stage or external locations (S3 buckets or Azure containers) using logical, granular paths. Create a partitioning structure that includes identifying details such as the application or location, along with the date when the data was written. You can then copy any fraction of the partitioned data into Snowflake with a single command. You can copy data into Snowflake by the hour, day, month, or even year when you initially populate tables.
s3://bucket_name/brand/2016/07/01/11/
s3://bucket_name/region/country/2016/07/01/14/
Note S3 transmits a directory list with each COPY statement used by Snowflake, so reducing the number of files in each directory improves the performance of your COPY statements.
Loading
By path of internal location or prefix of external location
By listing specific files to load (up to 1,000 per COPY command)
By using pattern matching to identify specific files by pattern
These options enable you to copy a fraction of the staged data into a Snowflake table with a single command. This allows you to execute concurrent COPY statements that match a subset of files, taking advantage of parallel operations. Do take special note that the file being copied must have the same data structure (i.e., number of columns, data type) as the table.
Tip
Listing specific files to load from a stage is generally the fastest.
In general, pattern matching using a regular expression is the slowest of the three options for identifying/specifying data files to load from a stage; however, this option works well if you exported your files in named order from your external application and want to batch load the files in the same order. Pattern matching can be combined with paths for further control over data loading.
Files that were loaded successfully can be deleted from the stage during a load by specifying the PURGE copy option in the COPY INTO <table> command.
After the load completes, use the REMOVE command to remove the files in the stage.
Querying Staged Files
A standard SELECT statement.
During a COPY into a table. Transformations may be applied to the columns in the same SELECT statement.
Note
Querying is primarily for performing simple queries during the data loading only and is not intended to replace querying already loaded tables.
Bulk Loading with the Snowflake Web Interface
For smaller files (less than 50 MB), loading from the Snowflake web interface is fine.
Note
If a file is large, then it should be loaded using SnowSQL or Snowpipe. See the following chapters for more information on how to bulk load with SnowSQL and Snowpipe.
Bulk File Loading Through the Snowflake User Interface
- 1.
You need a Snowflake account; please review Chapter 2 to set one up.
- 2.
Download the file named zips2000.csv.
Instructions
- 1.
Log into the Snowflake web user interface.
- 2.
Click Databases + Tables.
- 3.Click Create Table and enter the values shown in Figure 4-1.
- 4.Click Load Data, and the Load Data wizard (Figure 4-2) will appear. Select the warehouse you want to use and click Next.
- 5.Click Select Files to browse for zips2000.csv in the location you saved it in (Figure 4-3). Click Next.
- 6.Click the Add (looks like a plus sign) button to add a file format. Enter what is shown in Figure 4-4. Click Finish.
- 7.
Click Load, and your table will load. Once it has completed loading, you can query it as usual.
Summary
In this chapter, we reviewed bulk data loading with COPY, and we covered what bulk data loading is, file load locations, supported file formats and encoding, compression handling, and encryption options. We also covered bulk data loading recommendations including file preparation and staging, loading, and querying. In addition, we went through some sample exercises on bulk loading data using COPY in our virtual warehouse.