March 26, 2020
An article by Sergey Matrosov
We strongly recommend that you write and run the queries in this guide because learning by doing is key to understanding something. Good luck!
P.S. Feel free to send us constructive feedback. We are keen to make this guide as good as possible.
Google Bigquery is a cloud-based data warehouse, where you can group collections of data from other databases like SalesForce and data from Google Ads etc. by means of Bigquery datasets and tables.
• A ‘project’ is just a collection of datasets and is made using the Google Cloud Platform. If you need to stream data from your service to BigQuery, you can do this by using the project’s credentials and API.
• The ‘dataset’ is a collection of tables, like a folder with files.
• ‘Table’: This presents data in columns and rows. You can query, manipulate and extract particular data from this.
• ‘View’: This is a virtual table. You can think of it almost as a saved query that gives you the real table as a result of running it.
Plenty of data are streamed to BigQuery from the different services that are integrated with it. Let’s assume that your business is E-commerce. You have a website and CRM. These are at least two services that could be connected to BQ. BQ itself can stream its data to different platforms such as Google Data Studio or Tableau and data-visualization tools.
1. BigQuery consists of projects (data from services).
2. Projects consist of datasets.
3. Datasets consist of tables and views (queries).
4. Tables consist of rows and columns (two-dimensional array).
What is going on in the table?
‘Schema’ gives you an understanding of columns and each data type. The most common types are as follows:
• STRING – represents text;
The ‘Details’ section has information about the table itself, as follows:
• Table ID – path (address) of the table
‘Syntax’ is simple and refers to how you want the results to be shown – from where, with which conditions, how to show/represent your data and how much. The machine itself will execute syntax in this order:
3. GROUP BY
6. ORDER BY
As you can see, this order is not equal to the order of the query that you write in the interface. It is done like this to make the query more readable for users and to help them maintain it. Just remember that the machine executes it a little bit differently.
The syntax for all SQL dialects has pretty much the same structure.
Querying two tables
If two tables have identical schema and data types in columns, you can query these two tables at the same time. The result will be data from each table, given sequentially. The data from table 1 will be presented first, and then the data from table 2 will be presented.
If you want to take something, you should have an address or path where it is stored, and the path is structured as follows: project.dataset.table.
Sometimes, a table can have nested tables, or it could be partitioned by date, by timestamp or by etc. These are the main results of data streaming in BigQuery.
1. Table with nested tables: For example, project ‘bigquery-public-data’ with dataset ‘samples’ has the table ‘Shakespeare’, which can store such tables as ‘publication15980101’ and ‘sample15980201’ etc., where 1598 is the year.
If you use Standard SQL (and we strongly recommend using only this dialect in BigQuery), you should use wildcards in WHERE to query it, as follows:
SELECT * FROM `bigquery-public-data.samples.shakespeare.publication*` WHERE _TABLE_SUFFIX BETWEEN ‘15980101’ AND ‘16000101’
The result will be all columns with data from all Shakespeare’s publications between 1598 and 1600.
2. Partitioned table: This is like the previous table but it is one table that is divided into segments containing all the data and pseudo columns in schema, called, for example, _PARTITIONTIME or _PARTITIONDATE, depending on the dividing method. Let’s say it is partitioned by date.
SELECT * FROM `bigquery-public-data.samples.shakespeare` WHERE _PARTITIONTIME BETWEEN '1598-01-01' AND '1600-01-01'
‘WHERE’ is the main operator you must use in order to have a table with only the data you need. Without this, you’ll either get the same table that you have queried or all data from columns that you chose in SELECT.
It’s not that hard to make conditions. Just write ‘column’ and the value(s) that you want to see in this column.
Simple conditions are the precise value you want to pick out of the table.
SELECT * FROM `bigquery-public-data.samples.shakespeare` WHERE word = 'LVII'
More complex conditions
WHERE really likes the command BETWEEN. This is very useful when you need, for example, to show only those words that have a word_count of more than two but less than four. In other cases, it could be data between dates and timestamps etc.
SELECT * FROM `bigquery-public-data.samples.shakespeare` WHERE word_count BETWEEN 2 AND 3
‘OR' picks out all data that either have this or that value. In this case, it will be only those rows where column ‘word’ either contains ‘meet’ or ‘affords’.
‘AND’ takes only data that have both these records in a row. In this case, it will be only those rows, where column ‘word’ contains only ‘meet’ and simultaneously column ‘corpus’ contains only ‘sonnets’. Try this for yourself!
More and more conditions
You can combine a lot of conditions in WHERE, but remember that it must be readable for you and other users. Use parentheses.
SELECT * FROM `bigquery-public-data.samples.shakespeare` WHERE word = 'meet' AND (corpus_date=1590 OR corpus_date=1594) AND NOT corpus IN ('sonnets', '1kinghenryvi', 'comedyoferrors')
Remember: if you have a zero result that seems odd, check the WHERE condition.
From our point of view, this is one of the most complicated functions to explain. Firstly, it is used when you use aggregation functions, and secondly, it groups data. Let’s use SUM word_count of the word ‘QUEEN’ to see the total number of times this word appears in Shakespeare’s poetry.
SELECT word, word_count, corpus, corpus_date, SUM(word_count) AS sum_word_count FROM `bigquery-public-data.samples.shakespeare` WHERE word = 'QUEEN' GROUP BY word, word_count, corpus, corpus_date
Try it and see how it changes nothing. Why? Because every row is unique, and, of course, after GROUP BY, all of them are grouped: every row is a unique group – itself. Hence, grouping by QUEEN results in word_count = 76, corpus = 2kinghenryvi, corpus_date = 1590, and as a result, the sum_word_count = 76.
But what if we choose word and corpus_date in SELECT?
SELECT word, corpus_date, SUM(word_count) AS sum_word_count FROM `bigquery-public-data.samples.shakespeare` WHERE word = 'QUEEN' GROUP BY word, corpus_date
Look at corpus_date = 1590. Sum_word_count = 150. Now look at the first two rows in the result before that:
76 + 74 = 150
However, everything will be lost if we add word_count to the query column:
SELECT word, word_count, corpus_date, SUM(word_count) AS sum_word_count FROM `bigquery-public-data.samples.shakespeare` WHERE word = 'QUEEN' GROUP BY word, word_count, corpus_date
Look closely: rows 1 and 2 are unique because of different values in the word_counts column, so there cannot be any sum between these rows, which is why our aggregation does not work in the way we want. You see, to use aggregation, you should consider whether there will be repeated data in the columns that you want to group by. If one of these has mostly unrepeated data, your aggregation will be useless. So, in our case, we just need to take the ‘word’ column, as follows:
SELECT word, SUM(word_count) AS sum_word_count FROM `bigquery-public-data.samples.shakespeare` WHERE word = 'QUEEN' GROUP BY word
One more thing (albeit rare) is a situation where we could use no GROUP BY and just throw out the column ‘word’.
SELECT SUM(word_count) AS sum_word_count FROM `bigquery-public-data.samples.shakespeare` WHERE word = 'QUEEN'
Data types in the column: STRING / INTEGER / BOOLEAN / TIMESTAMP
1. STRING: text data (which must be put in inverted commas in the query)
2. INTEGER: whole numbers (queried without inverted commas)
3. BOOLEAN: TRUE / FALSE or 1 / 0 (queried without inverted commas)
4. TIMESTAMP: date + time data type (written as TIMESTAMP)(‘2017-07-23’)
To change the data type in the column, use CAST or SAFE_CAST functions in SELECT:
SELECT CAST(word_count AS STRING) AS word_count FROM `bigquery-public-data.samples.shakespeare`
Now column ‘word_count’ has a STRING data type instead of INTEGER.
1. ‘TIMESTAMP’: stores the date and time
2. ‘DATE’: stores the date only
There are plenty of methods for working with the formats that you can find in official documentation. Most of these are just about conversion from one date format to another.
We want to touch upon the issue where you want to have a date column. Here, it is preferable to either use TIMESTAMP with hours, minutes, seconds and even milliseconds or the INTEGER type, which stores the UNIX (UNIX Epoch) date. For example, for the date January 1, 2018, this could be 1514764801. What does this number mean? It is the number of seconds that have elapsed since January 1, 1970. If you see this kind of number but with three more digits (e.g., 1514764801000), it is the same but in milliseconds!
But is this preferable? The thing is that if you stream data from your website, you could have hits/events every second, but if you store it as Year-Month-Date, you don’t know whether the particular hit of 2018-01-01 was the first result of that day, the last or in the middle.
A helpful resource for testing conversion from seconds to the date (and vice versa) can be found at epochconverter.com.
January 1, 2018 in seconds:
January 1, 2018 in milliseconds:
‘VLOOKUP’ in Excel is good, but if you need to join two tables with 700,000 rows each, it’ll take you a few hours. However, with JOIN in SQL, it takes about five minutes.
‘JOIN’ merges two tables into one using a common column. (Here, we are talking about the data in the table, not the column names.) Moreover, this could be a LEFT, RIGHT, CROSS or INNER JOIN. For example, if this is a LEFT JOIN, then you JOIN everything in the left-hand part of the table.
As an example, we will use some other public data, namely data relating to the San Francisco Bikeshare. One table has station id and a number of available bikes. The second one has a public name and station id too. We want to have a table that has station id, the public name of this and the number of available bikes.
1. Tables must have columns with identical data. (In the example below, station_id is station_id in both tables.)
2. Columns must have an identical data type.
SELECT t1.station_id, t2.name, t1.num_bikes_available FROM (SELECT station_id, num_bikes_available FROM --HERE WE DEFINED WHICH TABLE WILL BE ON THE LEFT --ADD ALIAS T1 TO MAKE MANUPULATION WITH TABLE'S COLUMNS EASIER. `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_status`) AS t1 LEFT JOIN (SELECT station_id, name FROM --HERE WE DEFINED WHICH TABLE WILL BE ON THE RIGHT --ADD ALIAS T2 TO MAKE MANUPULATION WITH TABLE'S COLUMNS EASIER. `bigquery-public-data.san_francisco_bikeshare.bikeshare_station_info`) AS t2 --COLUMNS, THAT HAVE IDENTICAL DATA. SOMETIMES, THEY MAY HAVE DIFFERENT NAMES, BUT ONLY DATA WITHIN MATTERS! ON t1.station_id = t2.station_id
CASE: This is kind of magic among the functions of SQL, and there are two variants. Let’s take a look at the first one:
SELECT is_male, FROM `bigquery-public-data.samples.natality` LIMIT 100
This is nice, but we want ‘Yes’ or ‘No’ instead of ‘true’ and ‘false’.
SELECT CASE WHEN is_male = TRUE THEN 'Yes' WHEN is_male = FALSE THEN 'No' END AS is_male, FROM `bigquery-public-data.samples.natality` LIMIT 100
This is good, but we are told to make separate columns for ‘Male’ and ‘Female’. We don’t know why, but o.k., let’s do it.
SELECT CASE WHEN is_male = TRUE THEN 1 ELSE 0 END AS Male, CASE WHEN is_male = FALSE THEN 1 ELSE 0 END AS Female, FROM `bigquery-public-data.samples.natality` LIMIT 100
SUBQUERIES AND ‘WITH’ STATEMENT are another magic feature of SQL. Remember: every query just makes another NEW table, which we can use to query data with.
This table has four columns: word, word_count, corpus and corpus_date. Here, we will just take ‘word’:
SELECT word FROM `bigquery-public-data.samples.shakespeare`
We are free to query this for something. Let’s count the total number of words.
SELECT COUNT(word) AS total_number_of_words FROM ( SELECT word FROM `bigquery-public-data.samples.shakespeare`)
Please note that we query this table just like any other using SELECT FROM, and again we can subquery the result. Say we want to add 100,500 to total_number_of_words and name it ‘madness’.
SELECT total_number_of_words + 100500 AS madness FROM ( SELECT COUNT(word) AS total_number_of_words FROM ( SELECT word FROM `bigquery-public-data.samples.shakespeare`))
The query looks stepped. One day, you may need to debug such a query but with many subqueries, so our advice here is to read it from its core. In our case, it is SELECT word FROM `bigquery-public-data.samples.shakespeare'.
Subqueries can also be used in SELECT and WHERE. In our example, we got a 1 x 1 table that is itself equal to the value 265,156, which could be used as a condition in WHERE, as follows:
SELECT … FROM … WHERE Expression > (SELECT total_number_of_words + 100500 AS madness FROM ( SELECT COUNT(word) AS total_number_of_words FROM ( SELECT word FROM `bigquery-public-data.samples.shakespeare`)))
Another tip is to avoid making such stepped queries altogether because as you can see, it is hard to read. Use WITH statement instead. It essentially does the same thing but is far more readable:
WITH Core AS ( SELECT word FROM `bigquery-public-data.samples.shakespeare`), Count AS ( SELECT COUNT(word) AS total_number_of_words FROM Core), Madness AS ( SELECT total_number_of_words + 100500 AS mandess FROM Count) SELECT * FROM Madness
Each subquery is now made into a separate table (named ‘Core’, ‘Count’, ‘Madness’) within the query, which refers one to another. Our main and final query starts at 17. One thing to mention, though, is that WITH only works in Standard SQL.
Using these methods (the last one in particular), the data processing is done without multiplying tables and views.
SUBSTR is a string method that cuts your string and gives you a new one based on it. We will cut the word ‘QUEEN’, with the aim of having only the first letter (‘Q’) as a result.
You will want to use it as a marketer if, for example, you want to get a specific part of a URL from website streaming.
This will be a difficult but useful section. To be able to parse a column by a specific symbol (or delimiter like Text-to-Columns in Excel), we need to use the SPLIT and ARRAY method ‘OFFSET’, giving names to new columns. First of all, let’s look at the result of a query that gives us one column with date/time/time zone.
SELECT created_at FROM `bigquery-public-data.samples.github_nested` LIMIT 10
In this case, our delimiter will be the space symbol because there is a space between date and time, as well as between time and time zone. Our next step will involve using SPLIT:
SELECT SPLIT(created_at, ' ') AS full_date FROM `bigquery-public-data.samples.github_nested` LIMIT 10
As a result, we get rows with arrays in each. The first row is equal to [2012/03/30, 02:17:15, -0700].
Now we need to parse the array with OFFSET(n). The ‘n’ in parentheses corresponds to an element in the array (index of it), starting from zero.
SELECT full_date[OFFSET(0)] AS date, full_date[OFFSET(1)] AS time, full_date[OFFSET(2)] AS timezone FROM (SELECT SPLIT(created_at, ' ') AS full_date FROM `bigquery-public-data.samples.github_nested` LIMIT 10)
That’s all there is to it. Remember: if the index is out of range, for example, OFFSET(3), it will generate an error. A reasonable question to ask is why can’t we do it by SUBSTR? We could, and maybe in some cases it’s preferable, but it often lacks readability.
To merge columns, you need to use CONCAT, but this function only works with a STRING data type.
SELECT CONCAT(word, word_count, corpus, corpus_date) FROM `bigquery-public-data.samples.shakespeare` LIMIT 10
SQL can calculate different things by using its analytical functions.
SELECT MIN(word_count) AS min_words, MAX(word_count) AS max_words, AVG(word_count) AS average_words FROM `bigquery-public-data.samples.shakespeare`
Remember: if you want to calculate values for a specific column, you must use GROUP BY. Let’s use the example of the word ‘QUEEN’ again.
SELECT word, SUM(word_count) AS total_word_count FROM `bigquery-public-data.samples.shakespeare` WHERE word = "QUEEN" GROUP BY word