Querying nested and repeated records

Google BigQuery supports loading of JSON files into BigQuery tables. JSON format data can contain nested datatypes and repeated datatypes. The example table shown in the following screenshot has an Employee_Names column as RECORD datatype. Each record in that column has two columns, one to store the first name and one to store the last name. Create the table as shown in this screenshot:

Download the following file to load to this new table. The file is a JSON file that contains the records to be loaded into this table from JSON format: https://github.com/hthirukkumaran/Learning-Google-BigQuery/blob/master/chapter1/employeedetails.json.

Upload the file to your Google Cloud Storage bucket using the gsutil command as shown here:

gsutil cp employeedetails.json gs://myfirstprojectbucket201706/employeedetails.json

Run the following command to load the JSON file into the new table using the bq utility. The --source_format flag is used to specify the format of the file as JSON. Replace the filename and bucket name as per your project:

bq load --source_format=NEWLINE_DELIMITED_JSON HumanResourceDS.Employee_Nested_Demo gs://myfirstprojectbucket201706/employeedetails.json

The table should now have the records and look like this screenshot. The columns FirstName and LastName are shown using the parent column's name as the qualifier.

The following query will return all the columns from the table, including nested columns, as individual columns in the output. This query explicitly flattens out the fields in the record column type:

#standardSQL
SELECT EmployeeID, Employee_Names.FirstName, Employee_Names.LastName, Employee_Location
FROM `my-first-project-170319.HumanResourceDS.Employee_Nested_Demo`

BigQuery automatically lists all the child columns in the nested type if the parent column is specified. The first name and last name are shown as individual columns when this query is run:

#standardSQL
SELECT EmployeeID, Employee_Names, Employee_Location
FROM `my-first-project-170319.HumanResourceDS.Employee_Nested_Demo`

The following paragraph explains how to load a JSON file that contains repeated fields in it for a column. Create a new table as shown in the following screenshot. The Employee_Location column is marked as a REPEATED type and it shows various countries in which the employees have worked.

Download the JSON file from the following URL to load it to the new table created before: https://github.com/hthirukkumaran/Learning-Google-BigQuery/blob/master/chapter1/employeedetailsrepeated.json

Upload this file to the Google Cloud Storage bucket using gsutil, as follows:

gsutil cp employeedetailsrepeated.json gs://myfirstprojectbucket201706

Load the file to the destination table using the bq utility, as follows:

bq load --source_format=NEWLINE_DELIMITED_JSON HumanResourceDS.Employee_Repeated_Demo gs://myfirstprojectbucket201706/employeedetailsrepeated.json

Once the data is loaded, the table should look like the one shown here, showing all the countries for the employees:

The following query returns the flattened structure of the records in the table. The total number of records returned in this query will be 10:

#standardSQL
SELECT EmployeeID, Employee_Names.FirstName, Employee_Names.LastName, EmpCountry.Country
FROM `my-first-project-170319.HumanResourceDS.Employee_Repeated_Demo` ,
UNNEST(Employee_Location) as EmpCountry

The following query returns only the employees who have worked in USA and UK by flattening the Employee_Location column:

#standardSQL
ELECT EmployeeID, Employee_Names.FirstName, Employee_Names.LastName, EmpCountry.Country
FROM `my-first-project-170319.HumanResourceDS.Employee_Repeated_Demo` ,
UNNEST(Employee_Location) as EmpCountry
WHERE EmpCountry.Country in ('USA','UK')

This query returns the list of employees who have worked in at least two countries:

#standardSQL
SELECT EmployeeID, Employee_Names.FirstName, Employee_Names.LastName
FROM `my-first-project-170319.HumanResourceDS.Employee_Repeated_Demo`
WHERE ARRAY_LENGTH(Employee_Location) > 1

The following is a legacy SQL code block that will return the number of countries for each employee from the table. This uses the WITHIN clause:

SELECT EmployeeID, Employee_Names.FirstName, Employee_Names.LastName, COUNT(Employee_Location.Country) WITHIN RECORD as countrycount
FROM [HumanResourceDS.Employee_Repeated_Demo]
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.21.46.92