Using the bq utility for BigQuery

The bq command-line utility is used to interact with the Google BigQuery service on Google Cloud Platform:

  • Use the following command to check the version of the bq utility once the SDK is installed:
bq version
  • Type the following command to confirm which project the bq utility will use. The bq utility shares settings with the gcloud utility. If you wish to change the project, then run the command in the second line and choose the project you want to work on:
gcloud info
gcloud init
If an older version of Google Cloud SDK is installed on the machine, then run the bq init command to choose the default project for the bq utility to use. Use the bq help option to see the complete details of the command, its options, and its switches.
  • The first step in using BigQuery is to create a dataset in a project and then create tables under the dataset. The following command will create an empty dataset named HumanResourceDS in the project:
bq mk HumanResourceDS
  • To create a table in the dataset, use the make option and specify the table name, the list of columns in the table, and the datatype for the columns in the table. A new table will be created under the specified dataset with the specified columns and datatype:
bq mk -t HumanResourceDS.employee_details Employee_ID:INTEGER,Employee_First_Name:STRING,Employee_Last_Name:STRING,Employee_Joining_Date:DATE,Employee_location:STRING
bq load --skip_leading_rows 1 --autodetect HumanResourceDS.employee_details gs://bucketname/employeedetails.csv
It is recommended to use column headings in the file and make the header names in the file the same or similar to the BigQuery table column names. When BigQuery imports data into tables, if it finds a header row in the file it will try to match the column against the table columns. This helps developers not worry about the column order in the file and table.

The previous command assumes that the column delimiter and the row delimiter is a new line. In some cases, the column delimiter varies, and the user may be using ;, |Tab, or some other characters. Use the field_delimiter option to specify the delimiter for the columns. If the column delimiter is Tab then specify tab:

bq load --skip_leading_rows 1 --autodetect --field_delimiter tab HumanResourceDS.employee_details gs://bucketname/employeedetails.csv

The previous command will not insert any rows even if a single row failed to insert into the table from the file. If you need to insert rows that successfully match the schema and want to ignore failed rows, then use the --max-bad-records option as shown in the following code. The following example will allow only two rows, maximum, to fail from the file. If more than two rows failed to import from the file then no rows will be imported:

bq load --skip_leading_rows 1 --autodetect --field_delimiter tab --max-bad-records 2 HumanResourceDS.employee_details gs://bucketname/employeedetails.csv

By default, the load option will append the data to the table. If you would like to overwrite the data in the table then use the replace option, as shown in the following command:

bq load --skip_leading_rows 1 --autodetect --replace HumanResourceDS.employee_details gs://bucketname/employeedetails.csv
Google BigQuery provides options to import data from the following file formats in addition to standard CSV or TSV format: newline delimited JSON, Google Cloud Datastore backup, and Avro.

It is possible to import asynchronously without waiting for one import to finish. Use the nosync option as shown in the following command. If the command is run in no sync mode then the job ID that is importing the data is shown:

bq --nosync load --skip_leading_rows 1 --autodetect --replace HumanResourceDS.employee_details gs://bucketname/employeedetails.csv
The nosync option can be used with any command in the bq utility to perform that operation asynchronously. Use this option as the default for all your operations to increase performance. 

The following steps show how to check status of the BigQuery job and how to cancel the job and list the datasets and tables in the project:

  • To get the status of the job, run the show command as follows. This command shows basic information about the job:
bq show -j bqjob_r42a0d6157d2cfa52_0000015d2ae4f5a7_1
  • To cancel the job that was started in no sync mode, use the cancel command as follows to cancel the operation:
bq cancel bqjob_r42a0d6157d2cfa52_0000015d2ae4f5a7_1
  • The show command can be used to display the details of datasets, tables, and views in the project. The following command shows the details of a dataset and a table inside the dataset:
bq show HumanResourceDS
bq show HumanResourceDS.employee_details
  • To see the details of the dataset, jobs, tables, and so on, use the ls option. The first command will list all the datasets in the current project. The second line will list the jobs executed in the project under the current Google Cloud Account in which the SDK is configured. The third line will list the tables under the specified dataset:
bq ls
bq ls -j
bq ls HumanResourceDS

The jobs' history and statuses are shown here:

  • To remove the dataset from the project or delete tables from the dataset in a project, use the rm command shown as follows. The first line will remove the table after the user confirms its deletion. If you want to remove a table without prompting for user confirmation, use the -f option as shown in the second line. The third line removes all tables from the specified dataset without prompting the user. The dataset will also be deleted. Use this with caution:
bq rm DatasetName.TableName
bq rm -f DatasetName.TableName
bq rm -r -f DatasetName
  • If you wanted to introduce wait time for a job started with the nosync option, then use the wait command to make the script wait for a specified number of seconds or until the job finishes. The first line in the code makes the program wait until the specified job finishes. The second line in the code makes the program wait for 60 seconds and return after 60 seconds if the job is still running, or return immediately if the job ends before 60 seconds has passed:
bq wait job_id
bq wait job_id 60
  • To see the top n rows in a table, use the head command shown as follows. The first line of code displays five rows from the table. The second line of code skips the first five rows of the table and shows the next three rows in the table:
bq head -n 5 HumanResourceDS.employee_details
bq head -s 5 -n 3 HumanResourceDS.employee_details
  • To insert records into a table use the insert option as shown in the following code. The insert option supports the newline delimited JSON file format for import. Save one or more records in JSON format in a JSON file, as follows:
{"Employee_ID":"99","Employee_Joining_Date":"2012-09-24","Employee_Last_Name":"Raja","Employee_First_Name":"Dastigar","Employee_location":"UK"}
{"Employee_ID":"200","Employee_Joining_Date":"2017-02-02","Employee_Last_Name":"Sivaram","Employee_First_Name":"Haridass","Employee_location":"USA"}
  • Execute the insert command shown as follows. This will parse the JSON and load the record to the specified table:
bq insert HumanResourceDS.employee_details record.json
  • To export data from the table to a CSV format or JSON format, use the extract command as shown here. The first line exports all the records in the specified table to CSV format. The second line exports all the records from the specified table in JSON format:
bq extract HumanResourceDS.employee_details gs://bucketname/records.csv
bq extract --destination_format NEWLINE_DELIMITED_JSON HumanResourceDS.employee_details gs://bucketname/records.json
Use the --compression option with GZIP or NONE as the value to compress the exported data to save time during download. If you want to have a different delimiter for columns, use the --field_delimiter option to specify that character.
  • To copy all the records from one table to another, and either add to the new table or overwrite the destination table, use the cp command as follows. The destination table will be created if it does not exist, but the target dataset must exist. By default, the data in the destination table will be overwritten after user confirmation. Use the -append_table option if you want to append records to the destination table. If you want to suppress the prompt, use the -f option:
bq cp HumanResourceDS.employee_details CopyDS.new_employee_details
bq cp -append_table HumanResourceDS.employee_details CopyDS.new_employee_details
  • The next command is the most important and most frequently used command in the bq utility. It is the query command. To explore the query command in the bq utility, we will be using the public dataset provided by Google BigQuery. The dataset is available under the Public Datasets project, named nyc-tlc:yellow. It has one table, as shown in the following screenshot, with over 1 billion rows in it:
  • This is a query that returns the number of trips made by a cab per year for each year. This query uses only one column from the table and should not take more than 10 GB of data to search:
SELECT year(pickup_datetime) as trip_year, count(1) as trip_count
FROM [nyc-tlc:yellow.trips]
group by trip_year
order by trip_year
  • Run the following query using the bq utility query option as shown in the following code. The result set will be small and hence can be seen on the screen:
bq query "SELECT year(pickup_datetime) as trip_year, count(1) as trip_count FROM [nyc-tlc:yellow.trips] group by trip_year order by trip_year"
  • This command will show the output, as shown in the following screenshot. The trip count per year is displayed in the console:
  • To load the query result into a new or existing table, use the --destination_table option, as shown in the following code. The destination table will be created if it does not exist.
  • If the destination dataset is not already created then run the following command to create it.  bq mk ResultDS
  • If the table already exists then the bq utility will throw the error Already Exists: Table project-id:datasetname.tablename:
bq query --destination_table ResultDS.TripsPerYear "SELECT year(pickup_datetime) as trip_year, count(1) as trip_count FROM [nyc-tlc:yellow.trips] group by trip_year order by trip_year"
  • To append data to an existing table, use the --append_table option as shown here. To overwrite the table with new data, use the --replace option instead of --append_table in the following code:
bq query --destination_table ResultDS.TripsPerYear --append_table "SELECT year(pickup_datetime) as trip_year, count(1) as trip_count FROM [nyc-tlc:yellow.trips] group by trip_year order by trip_year" 
  • The following query will throw an error when run because the result returned by the query is more than the compressed size of 128 MB:
SELECT pickup_datetime
,dropoff_datetime
,pickup_longitude
,pickup_latitude
,dropoff_longitude
,dropoff_latitude
FROM [nyc-tlc:yellow.trips]
WHERE year(pickup_datetime) = 2015
  • The preceding query, when run using the bq utility in the query command by specifying a destination table, will throw the following error. If a query is returned with a compressed size greater than 128 MB, then the allow_large_results flag must be set to true:
Error: Response too large to return. Consider setting destinationTable or (for legacy SQL queries) setting allowLargeResults to true in your job configuration. For more details, see https://cloud.google.com/bigquery/querying-data#largequeryresults.

The preceding query should be run with the --allow_large_results flag, as shown in the following code. This query returns the specified columns for all the trips made in 2015:

bq query --destination_table ResultDS.TripsPerYear --allow_large_results "SELECT pickup_datetime,dropoff_datetime,pickup_longitude,pickup_latitude,dropoff_longitude,dropoff_latitude FROM [nyc-tlc:yellow.trips] WHERE year(pickup_datetime) = 2015"

One of the ways to save billing in BigQuery is to specify the use_cache option when running queries using the bq utility. This option will return the result from the cache if the tables involved in the query have no changes in the records. When a query returns a result from the cache, you are not billed for it. Another benefit of using this flag is that it gets the results faster if it is already cached:

bq query --use_cache "SELECT year(pickup_datetime) as trip_year, count(1) as trip_count FROM [nyc-tlc:yellow.trips] group by trip_year order by trip_year"

The other option that will help check if a query is in the cache or not is the require_cache option. When this flag is added to the command, the query will be executed only if the query was already executed and the results of it were stored in the cache and remain in a valid state. If the query is not in the cache, then an error similar to the following one is thrown:

BigQuery error in query operation: Error processing job 'my-first-project-170319:bqjob_r24b2e799987f7072_0000015d35289e4a_1':
Not found: Table my-first-project-170319:_b513538e07c9532385313c26aa07553fc064a62e.anon34f43b24853de689a717ff5cfc9fb86c75a44d97

The following command uses --require-cache to execute the query and get its result only from cache and not execute the query if it is not in cache:

bq query --require_cache "SELECT year(pickup_datetime) as trip_year, count(1) as trip_count FROM [nyc-tlc:yellow.trips] group by trip_year order by trip_year

This option is helpful for running expensive queries which are supposed to be cached after their first run. The other option that helps estimate the cost of queries to save billing is the dry_run flag. The dry_run flag will not execute the query but will return an estimate of the bytes of data to be processed. This will help the user decide how expensive the query may be:

bq query --dry_run "SELECT year(pickup_datetime) as trip_year, count(1) as trip_count FROM [nyc-tlc:yellow.trips] group by trip_year order by trip_year"

The dry run of a query will display a message similar to the following, showing the expected number of bytes to be processed. This is a very useful option for QA team members to estimate the cost of running queries:

Query successfully validated. Assuming the tables are not modified, running this query will process 8870235704 bytes of data.

There are a few additional options available in the bq utility query command. Here are quick notes for them:

  • --batch: This flag will make the query run in batch mode. To populate huge data into a destination table, use this option so that the query can be run as a background job. Batch mode jobs have fewer quota restrictions but take a few hours to complete. This option would be helpful during the initial loading of data from your data warehouse to Google BigQuery.
  • --max_rows: This flag will make the query return only the specified number of rows. The default number of rows returned is 100.
  • --use_legacy_sql: This flag will make the query run as legacy SQL instead of standard SQL. More on this topic will be covered in Chapter 4, BigQuery SQL Basic.
  • --start_row: This flag will specify the start row to return from the results set of the query.
Use the command-line options bq utility and gsutil utility for your initial loading process and batch jobs. For interactive applications, use BigQuery and Google Cloud Storage API which makes it more extensible to your requirements.

The following are some of the standard operations implemented in the projects I have worked on using Google BigQuery. These are jobs that are run on a scheduled basis:

  • Using gsutil, upload all the files generated by the applications running on-premises or in other cloud service provider networks to Google Cloud Storage.
  • Use the bq utility to load the files into the BigQuery tables with either the append or overwrite option. The append and overwrite options will create a destination table if it does not exist.
  • Use the bq utility to run expensive queries, cache the results set, and force the applications to use the cache flag when executing the queries. In some queries, force the app to use a required cache flag to save money on billing for very expensive queries.
  • Use the bq utility to run aggregation jobs and load the data into daily aggregation, weekly aggregation, or monthly aggregation tables.
  • Use gsutil to change the storage class of buckets after n number of days to nearline or coldline storage to save some money.
  • Use gsutil to remove unwanted files in the storage, to turn off versioning for buckets if not needed, and to compose multiple files into one file for easy archiving and restoring.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.128.205.21