How to work in Google BigQuery using SQL. A simple guide to using Google BigQuery.

Hi there! This might be your first guide to SQL and BigQuery. If so, that’s cool, but if it isn’t, we hope that our guide will provide clear explanations and help you understand things better. The goal of this guide is to help you become more familiar with SQL and BigQuery as quickly as possible so that you can make simple queries by the end of it. With practice, you will be able to carry out difficult and complex data analysis tasks.

We strongly recommend that you write and run the queries in this guide because learning by doing is key to understanding something. Good luck!

P.S. Feel free to send us constructive feedback. We are keen to make this guide as good as possible.

What is Google Bigquery?

Google Bigquery is a cloud-based data warehouse, where you can group collections of data from other databases like SalesForce and data from Google Ads etc. by means of Bigquery datasets and tables.

Terminology

• A ‘project’ is just a collection of datasets and is made using the Google Cloud Platform. If you need to stream data from your service to BigQuery, you can do this by using the project’s credentials and API.

• The ‘dataset’ is a collection of tables, like a folder with files.

• ‘Table’: This presents data in columns and rows. You can query, manipulate and extract particular data from this.

• ‘View’: This is a virtual table. You can think of it almost as a saved query that gives you the real table as a result of running it.

Plenty of data are streamed to BigQuery from the different services that are integrated with it. Let’s assume that your business is E-commerce. You have a website and CRM. These are at least two services that could be connected to BQ. BQ itself can stream its data to different platforms such as Google Data Studio or Tableau and data-visualization tools.

Summary:

1. BigQuery consists of projects (data from services).

2. Projects consist of datasets.

3. Datasets consist of tables and views (queries).

4. Tables consist of rows and columns (two-dimensional array).

Queries

What is going on in the table?

‘Schema’ gives you an understanding of columns and each data type. The most common types are as follows:

• STRING – represents text;
• INTEGER – whole numbers;
• BOOLEAN – the type, which has one of two possible values (1 and 0), where 1 = TRUE, and 0 = FALSE;
• DATA – stores only the date, without the time part;
• TIMESTAMP – is used to store data that contain both date and time parts;
• FLOAT – stores numbers with a fractional value, such as 3.14. Remember: this value is not an exact value; 3.14, for example, is a well-known pattern (Pi), but Pi itself has an infinite number of digits after the delimiter ‘,’. Use this type to show that a number is not a whole;
• NUMERIC – also stores numbers as an INTEGER or FLOAT but where precision needs to be ensured.

The ‘Details’ section has information about the table itself, as follows:

• Table ID – path (address) of the table
• Table size
• Number of rows
• Creation time
• Last modified
• Expiration – the date when the table is deleted (generally set as ‘Never’ = ‘Do not delete’)
• Data location – the region where you want to store your data (EU or US). Remember the stipulations of the GDPR when setting this up.

SELECT

Basics

‘Syntax’ is simple and refers to how you want the results to be shown – from where, with which conditions, how to show/represent your data and how much. The machine itself will execute syntax in this order:

SELECT:
1. FROM
2. WHERE
3. GROUP BY
4. HAVING
5. SELECT
6. ORDER BY

As you can see, this order is not equal to the order of the query that you write in the interface. It is done like this to make the query more readable for users and to help them maintain it. Just remember that the machine executes it a little bit differently.

The syntax for all SQL dialects has pretty much the same structure.

Querying two tables

If two tables have identical schema and data types in columns, you can query these two tables at the same time. The result will be data from each table, given sequentially. The data from table 1 will be presented first, and then the data from table 2 will be presented.

FROM

If you want to take something, you should have an address or path where it is stored, and the path is structured as follows: project.dataset.table.

Sometimes, a table can have nested tables, or it could be partitioned by date, by timestamp or by etc. These are the main results of data streaming in BigQuery.

1. Table with nested tables: For example, project ‘bigquery-public-data’ with dataset ‘samples’ has the table ‘Shakespeare’, which can store such tables as ‘publication15980101’ and ‘sample15980201’ etc., where 1598 is the year.

If you use Standard SQL (and we strongly recommend using only this dialect in BigQuery), you should use wildcards in WHERE to query it, as follows:

SELECT
  *
FROM
  `bigquery-public-data.samples.shakespeare.publication*`
WHERE
  _TABLE_SUFFIX BETWEEN ‘15980101’ AND ‘16000101’

The result will be all columns with data from all Shakespeare’s publications between 1598 and 1600.

2. Partitioned table: This is like the previous table but it is one table that is divided into segments containing all the data and pseudo columns in schema, called, for example, _PARTITIONTIME or _PARTITIONDATE, depending on the dividing method. Let’s say it is partitioned by date.

SELECT
  *
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE
  _PARTITIONTIME BETWEEN '1598-01-01' AND '1600-01-01'

WHERE

‘WHERE’ is the main operator you must use in order to have a table with only the data you need. Without this, you’ll either get the same table that you have queried or all data from columns that you chose in SELECT.

It’s not that hard to make conditions. Just write ‘column’ and the value(s) that you want to see in this column.

Simple Conditions

Simple conditions are the precise value you want to pick out of the table.

SELECT
  *
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE
  word = 'LVII'

More complex conditions

WHERE really likes the command BETWEEN. This is very useful when you need, for example, to show only those words that have a word_count of more than two but less than four. In other cases, it could be data between dates and timestamps etc.

SELECT
  *
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE
  word_count BETWEEN 2 AND 3

AND / OR

SELECT * FROM `bigquery-public-data.samples.shakespeare` WHERE word = 'meet' OR word = 'affords'	SELECT * FROM `bigquery-public-data.samples.shakespeare` WHERE word = 'meet' AND corpus = 'sonnets'
‘OR' picks out all data that either have this or that value. In this case, it will be only those rows where column ‘word’ either contains ‘meet’ or ‘affords’.	‘AND’ takes only data that have both these records in a row. In this case, it will be only those rows, where column ‘word’ contains only ‘meet’ and simultaneously column ‘corpus’ contains only ‘sonnets’. Try this for yourself!

More and more conditions

You can combine a lot of conditions in WHERE, but remember that it must be readable for you and other users. Use parentheses.

SELECT
  *
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE
  word = 'meet'
  AND (corpus_date=1590 OR corpus_date=1594)
  AND NOT corpus IN ('sonnets', '1kinghenryvi', 'comedyoferrors')

Remember: if you have a zero result that seems odd, check the WHERE condition.

GROUP BY

From our point of view, this is one of the most complicated functions to explain. Firstly, it is used when you use aggregation functions, and secondly, it groups data. Let’s use SUM word_count of the word ‘QUEEN’ to see the total number of times this word appears in Shakespeare’s poetry.

SELECT
  word,
  word_count,
  corpus,
  corpus_date,
  SUM(word_count) AS sum_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE
  word = 'QUEEN'
GROUP BY
  word,
  word_count,
  corpus,
  corpus_date

Try it and see how it changes nothing. Why? Because every row is unique, and, of course, after GROUP BY, all of them are grouped: every row is a unique group – itself. Hence, grouping by QUEEN results in word_count = 76, corpus = 2kinghenryvi, corpus_date = 1590, and as a result, the sum_word_count = 76.

But what if we choose word and corpus_date in SELECT?

SELECT
  word,
  corpus_date,
  SUM(word_count) AS sum_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE
  word = 'QUEEN'
GROUP BY
  word,
  corpus_date

Look at corpus_date = 1590. Sum_word_count = 150. Now look at the first two rows in the result before that:

76 + 74 = 150

However, everything will be lost if we add word_count to the query column:

SELECT
  word,
  word_count,
  corpus_date,
  SUM(word_count) AS sum_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE
  word = 'QUEEN'
GROUP BY
  word,
  word_count,
  corpus_date

Look closely: rows 1 and 2 are unique because of different values in the word_counts column, so there cannot be any sum between these rows, which is why our aggregation does not work in the way we want. You see, to use aggregation, you should consider whether there will be repeated data in the columns that you want to group by. If one of these has mostly unrepeated data, your aggregation will be useless. So, in our case, we just need to take the ‘word’ column, as follows:

SELECT
  word,
  SUM(word_count) AS sum_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE
  word = 'QUEEN'
GROUP BY
  word

One more thing (albeit rare) is a situation where we could use no GROUP BY and just throw out the column ‘word’.

SELECT
  SUM(word_count) AS sum_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE
  word = 'QUEEN'

Data types in the column: STRING / INTEGER / BOOLEAN / TIMESTAMP

1. STRING: text data (which must be put in inverted commas in the query)
2. INTEGER: whole numbers (queried without inverted commas)
3. BOOLEAN: TRUE / FALSE or 1 / 0 (queried without inverted commas)
4. TIMESTAMP: date + time data type (written as TIMESTAMP)(‘2017-07-23’)

To change the data type in the column, use CAST or SAFE_CAST functions in SELECT:

SELECT
  CAST(word_count AS STRING) AS word_count
FROM
  `bigquery-public-data.samples.shakespeare`

Now column ‘word_count’ has a STRING data type instead of INTEGER.

TIMESTAMP AND DATE

1. ‘TIMESTAMP’: stores the date and time
2. ‘DATE’: stores the date only

There are plenty of methods for working with the formats that you can find in official documentation. Most of these are just about conversion from one date format to another.

We want to touch upon the issue where you want to have a date column. Here, it is preferable to either use TIMESTAMP with hours, minutes, seconds and even milliseconds or the INTEGER type, which stores the UNIX (UNIX Epoch) date. For example, for the date January 1, 2018, this could be 1514764801. What does this number mean? It is the number of seconds that have elapsed since January 1, 1970. If you see this kind of number but with three more digits (e.g., 1514764801000), it is the same but in milliseconds!

But is this preferable? The thing is that if you stream data from your website, you could have hits/events every second, but if you store it as Year-Month-Date, you don’t know whether the particular hit of 2018-01-01 was the first result of that day, the last or in the middle.

A helpful resource for testing conversion from seconds to the date (and vice versa) can be found at epochconverter.com.

January 1, 2018 in seconds:

January 1, 2018 in milliseconds:

VLOOKUP 2.0 or joined tables: LEFT-JOIN EACH / RIGHT-JOIN EACH

‘VLOOKUP’ in Excel is good, but if you need to join two tables with 700,000 rows each, it’ll take you a few hours. However, with JOIN in SQL, it takes about five minutes.

‘JOIN’ merges two tables into one using a common column. (Here, we are talking about the data in the table, not the column names.) Moreover, this could be a LEFT, RIGHT, CROSS or INNER JOIN. For example, if this is a LEFT JOIN, then you JOIN everything in the left-hand part of the table.

As an example, we will use some other public data, namely data relating to the San Francisco Bikeshare. One table has station id and a number of available bikes. The second one has a public name and station id too. We want to have a table that has station id, the public name of this and the number of available bikes.

Remember:

1. Tables must have columns with identical data. (In the example below, station_id is station_id in both tables.)
2. Columns must have an identical data type.

SELECT
  t1.station_id,
  t2.name,
  t1.num_bikes_available
FROM
 (SELECT
  station_id,
  num_bikes_available
FROM
   --HERE WE DEFINED WHICH TABLE WILL BE ON THE LEFT
   --ADD ALIAS T1 TO MAKE MANUPULATION WITH TABLE'S COLUMNS EASIER.
  `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_status`) AS t1
LEFT JOIN
 (SELECT
  station_id,
  name
FROM
   --HERE WE DEFINED WHICH TABLE WILL BE ON THE RIGHT
   --ADD ALIAS T2 TO MAKE MANUPULATION WITH TABLE'S COLUMNS EASIER.
  `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`) AS t2
   --COLUMNS, THAT HAVE IDENTICAL DATA. SOMETIMES, THEY MAY HAVE DIFFERENT NAMES, BUT ONLY DATA WITHIN MATTERS!
ON t1.station_id = t2.station_id

CASE WHEN... THEN... END

CASE: This is kind of magic among the functions of SQL, and there are two variants. Let’s take a look at the first one:

SELECT
  is_male,
FROM
  `bigquery-public-data.samples.natality`
LIMIT 100

This is nice, but we want ‘Yes’ or ‘No’ instead of ‘true’ and ‘false’.

SELECT
CASE
  WHEN is_male = TRUE THEN 'Yes'
  WHEN is_male = FALSE THEN 'No'
END
  AS is_male,
FROM
  `bigquery-public-data.samples.natality`
LIMIT
  100

Variant 2

This is good, but we are told to make separate columns for ‘Male’ and ‘Female’. We don’t know why, but o.k., let’s do it.

SELECT
CASE
  WHEN is_male = TRUE THEN 1 ELSE 0
END AS
  Male,
CASE
  WHEN is_male = FALSE THEN 1 ELSE 0
END AS
  Female,
FROM
  `bigquery-public-data.samples.natality`
LIMIT
  100

SUBQUERIES, WITH, PARSING, SUBSTRING, MERGING TWO COLUMNS IN ONE

SUBQUERIES AND ‘WITH’ STATEMENT are another magic feature of SQL. Remember: every query just makes another NEW table, which we can use to query data with.

This table has four columns: word, word_count, corpus and corpus_date. Here, we will just take ‘word’:

SELECT
  word
FROM
  `bigquery-public-data.samples.shakespeare`

We are free to query this for something. Let’s count the total number of words.

SELECT
  COUNT(word) AS total_number_of_words
FROM (
  SELECT
    word
  FROM
    `bigquery-public-data.samples.shakespeare`)

Please note that we query this table just like any other using SELECT FROM, and again we can subquery the result. Say we want to add 100,500 to total_number_of_words and name it ‘madness’.

SELECT
  total_number_of_words + 100500 AS madness
FROM (
  SELECT
    COUNT(word) AS total_number_of_words
  FROM (
    SELECT
      word
    FROM
      `bigquery-public-data.samples.shakespeare`))

The query looks stepped. One day, you may need to debug such a query but with many subqueries, so our advice here is to read it from its core. In our case, it is SELECT word FROM `bigquery-public-data.samples.shakespeare'.

Subqueries can also be used in SELECT and WHERE. In our example, we got a 1 x 1 table that is itself equal to the value 265,156, which could be used as a condition in WHERE, as follows:

SELECT
…
FROM
…
WHERE
Expression > (SELECT
  total_number_of_words + 100500 AS madness
FROM (
  SELECT
    COUNT(word) AS total_number_of_words
  FROM (
    SELECT
      word
    FROM
      `bigquery-public-data.samples.shakespeare`)))

Another tip is to avoid making such stepped queries altogether because as you can see, it is hard to read. Use WITH statement instead. It essentially does the same thing but is far more readable:

WITH
  Core AS (
  SELECT
    word
  FROM
    `bigquery-public-data.samples.shakespeare`),
  Count AS (
  SELECT
    COUNT(word) AS total_number_of_words
  FROM
    Core),
  Madness AS (
  SELECT
    total_number_of_words + 100500 AS mandess
  FROM
    Count)
SELECT
  *
FROM
  Madness

Each subquery is now made into a separate table (named ‘Core’, ‘Count’, ‘Madness’) within the query, which refers one to another. Our main and final query starts at 17. One thing to mention, though, is that WITH only works in Standard SQL.

Using these methods (the last one in particular), the data processing is done without multiplying tables and views.

SUBSTR

SUBSTR is a string method that cuts your string and gives you a new one based on it. We will cut the word ‘QUEEN’, with the aim of having only the first letter (‘Q’) as a result.

You will want to use it as a marketer if, for example, you want to get a specific part of a URL from website streaming.

Column Parsing

This will be a difficult but useful section. To be able to parse a column by a specific symbol (or delimiter like Text-to-Columns in Excel), we need to use the SPLIT and ARRAY method ‘OFFSET’, giving names to new columns. First of all, let’s look at the result of a query that gives us one column with date/time/time zone.

SELECT
  created_at
FROM
  `bigquery-public-data.samples.github_nested`
LIMIT
  10

In this case, our delimiter will be the space symbol because there is a space between date and time, as well as between time and time zone. Our next step will involve using SPLIT:

SELECT
  SPLIT(created_at, ' ') AS full_date
FROM
  `bigquery-public-data.samples.github_nested`
LIMIT
  10

As a result, we get rows with arrays in each. The first row is equal to [2012/03/30, 02:17:15, -0700].

Now we need to parse the array with OFFSET(n). The ‘n’ in parentheses corresponds to an element in the array (index of it), starting from zero.

SELECT
  full_date[OFFSET(0)] AS date,
  full_date[OFFSET(1)] AS time,
  full_date[OFFSET(2)] AS timezone
FROM
  (SELECT
    SPLIT(created_at, ' ') AS full_date
FROM
  `bigquery-public-data.samples.github_nested`
LIMIT
  10)

That’s all there is to it. Remember: if the index is out of range, for example, OFFSET(3), it will generate an error. A reasonable question to ask is why can’t we do it by SUBSTR? We could, and maybe in some cases it’s preferable, but it often lacks readability.

Merge columns

To merge columns, you need to use CONCAT, but this function only works with a STRING data type.

SELECT
  CONCAT(word, word_count, corpus, corpus_date)
FROM
  `bigquery-public-data.samples.shakespeare`
LIMIT
  10

Before	After

Make SQL do the calculations instead of you – analytical functions

SQL can calculate different things by using its analytical functions.

SELECT
  MIN(word_count) AS min_words,
  MAX(word_count) AS max_words,
  AVG(word_count) AS average_words
FROM
  `bigquery-public-data.samples.shakespeare`

Remember: if you want to calculate values for a specific column, you must use GROUP BY. Let’s use the example of the word ‘QUEEN’ again.

SELECT
  word,
  SUM(word_count) AS total_word_count
FROM
  `bigquery-public-data.samples.shakespeare`
WHERE
  word = "QUEEN"
GROUP BY
  word