Sanitizing data

Most data warehousing projects follow a standard process. This process involves the extraction of data from a data source, the transformation of this data to both the standards of the data warehouse and the requirements of the end user, and the loading of data into the resulting database table. This process is more commonly known as the Extract, Transform, Load Process, or ETL for short.

The transformation step is important for a few reasons:

Decoding of encoded values (that is, converting values of W and E to West and East)
Calculation of values (that is, calculating the average order value by dividing the revenue by the count of orders)
Splitting separated lists into individual columns
Aggregation
Data validation, either in the form of invalidating incorrect values or as reprocessing of incorrect data

If you are using files to load data into BigQuery then following are some of the cleanup that has to be done on the files.

Remove the ASCII character 0 - NUL.
If you are using BCP in SQL Server or Unix or Linux systems then the files generated have a high chance of containing this character. When extracting the data from the source system like database try removing these NUL characters. BigQuery treats this as End-Of-Line character and hence it will throw an error when loading files with this character. It will display an error that the row containing this character does not have the value for all columns. The way you get rid of NUL is by one of the following ways,
- If the RDBMS you are using has this character then use LEN(column_name) = 0 condition to replace the column value with NULL or Empty String.
- If you want to manually replace NUL character then you can do it for small files by opening it in Notepad++ like editor which shows hexcode for invisible characters. Do a find and replace of to empty string.
- For large files you can replace the character using string functions.
- In Linux you can use tr command as shown in the example: tr < inputfile.csv -d ‘00’ > sanitizedfile.csv
Delimiter in the column value

Sometimes the delimiter used for separating columns is contained in one of the column values. This causes the error that more than n columns provided for import. In the source system replace the delimiter column with some other character or use a delimiter character that will not be in used in the column values. One other common scenario is to use , as delimiter but the column values like address will have the , in them. In such cases enclose the column values within double quotes. This will increase your file size.

Improper closing of quotes

If your column values happen to have quotes in them then BigQuery throws an error. BigQuery treats quote as start of column value enclosed in quotes. If the number of quotes is even say you enclose a particular column value or a substring in column value is enclosed within “ or ‘ that is fine. if there is only one ” or ‘ in the column value then it throws error. The solution is to enclose that column value within the other quotes for example if your column value has one ‘ then enclose that within “”. If your column value has one ” enclose it within ”.

Unicode replacement character

If your export system does exporting of data only in ASCII encoding then you may run into this problem. The Unicode characters will be replaced in the file with Unicode Replacement Character �. This causes error when BigQuery parses such files. Replace this with space or allow your extraction process to extract Unicode characters properly.

If the file processing requires extensive set of tasks then it is better to use awk and sed commands in your projects. These commands are available in Linux and Mac by default and for Windows these can be downloaded. The following are some of the common scenarios where either awk or sed can be used for manipulating the data in the files.

The following command will replace the text NULL with empty character globally in a given file. Files extracted from traditional RDBMS system tend to store null value for columns as NULL in text format.

sed -i 's/NULL//g' filename.ext

The following command can be used to add a text or new column to a csv file at the beginning of each row. This command will add the value Texas, to start of all rows in the file.

sed -i 's/^/Texas, /' testfile.csv

The following command can be used to add a text or new column to a csv file at the end of each row. This command will add the value Texas, to end of all rows in the file.

sed -i 's/$/ Texas/' testfile.csv

The above command replaces the source file with new contents so use these command with caution and if needed backup the source file to another location.

The following command adds a serial number as first column to all the rows in the table. This will help troubleshoot the loading process if it fails due to a particular line. The output of the command is stored in testfile1.csv file in the example below.

awk -F, '{$1=++idx FS $1;}1' OFS=, testfile.csv > testfile1.csv

The following command removes the first row in the file. This option can be used if you want to remove header row in your files.

awk '{if (NR!=1) {print}}' testfile.csv

The following command replaces the 3rd column with empty value in the file and stores the output in testfile2.csv file. This will be helpful when you want to anonymize the data say removing first name, last name, email id in the file to be uploaded.

awk -F, '{$3="";}1' OFS=, testfile.csv > testfile2.csv

The following command generates one file per unique value for 5th column in a file. Rows having the same value in 5th column will be exported to one file and one file will be generated per unique value for 5th column.

awk -F, '{print > $5".csv"}' testfile.csv

The following command can be used to merge multiple files into a single file and this will be useful to do bulk data upload. The command below selects files with start with name mydata_ and have extension .csv

cat mydata_*.csv > merged_data.csv

Table of Contents for Sanitizing data

Create new playlist

Sign In

Sign Up

Table of Contents for
Sanitizing data