Skip to content

Commit e19e189

Browse files
Add documentation for flattening BigQuery table. (googlegenomics#279)
* Add documentation for flattening BigQuery table.
1 parent f88cda5 commit e19e189

11 files changed

+210
-0
lines changed

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -185,6 +185,7 @@ region in support of your project’s security and compliance needs. See
185185
* [Appending data to existing tables](docs/data_append.md)
186186
* [Variant Annotation](docs/variant_annotation.md)
187187
* [Partitioning](docs/partitioning.md)
188+
* [Flattening the BigQuery table](docs/flattening_table.md)
188189
* [Troubleshooting](docs/troubleshooting.md)
189190

190191
## Development

docs/flattening_table.md

Lines changed: 209 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,209 @@
1+
# Flattening the BigQuery table
2+
3+
Querying multiple independently repeated fields or calculating the cross product
4+
of such fields requires "flattening" the BigQuery records. You may have seen
5+
error messages like `"Cannot query the cross product of repeated fields ..."`
6+
from BigQuery in such scenarios. This page describes the workarounds for
7+
enabling such queries and exporting a flattened BigQuery table that can be
8+
directly used in tools that required a flattened table structure (e.g. for
9+
easier data visualization).
10+
11+
Please note that the instructions in this page are for
12+
[Standard SQL](https://cloud.google.com/bigquery/docs/reference/standard-sql/)
13+
and not
14+
[Legacy SQL](https://cloud.google.com/bigquery/docs/reference/legacy-sql).
15+
16+
17+
## Flattening basics
18+
19+
Consider the following BigQuery row:
20+
21+
![Flatten original row](images/flatten_original_row.png)
22+
23+
It contains two alternate bases (`C` and `T`) and two calls (`NA12890`
24+
and `NA12878`).
25+
26+
To get a table that contains one call per row, you need to explicitly flatten
27+
the table on the repeated call record as follows:
28+
29+
```
30+
#standardSQL
31+
SELECT
32+
reference_name, start_position, end_position, reference_bases,
33+
call.name AS call_name
34+
FROM
35+
`project.dataset.table` AS t,
36+
t.call AS call
37+
```
38+
39+
![Flatten call names](images/flatten_call_names.png)
40+
41+
42+
Note that BigQuery throws the error
43+
`"Cannot access field name on a value with type ARRAY<STRUCT<name ..."` if you
44+
do not include the additional `t.call AS call` statement in the `FROM` clause.
45+
Please see
46+
[this page](https://cloud.google.com/bigquery/docs/reference/standard-sql/migrating-from-legacy-sql#removing_repetition_with_flatten)
47+
for more details. Also, note that explicitly using `UNNEST` is not necessary as
48+
a fully-qualified path is used, but you may also use `UNNEST(call) AS call`
49+
instead of `t.call AS call`. Please see
50+
[here](https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#field_path)
51+
for more details.
52+
53+
You can include additional information for each call by adding them to the
54+
`SELECT` clause. For instance, the following query adds the call genotypes (as
55+
an array of integers) to the result.
56+
57+
```
58+
#standardSQL
59+
SELECT
60+
reference_name, start_position, end_position, reference_bases,
61+
call.name AS call_name, call.genotype
62+
FROM
63+
`project.dataset.table` AS t,
64+
t.call AS call
65+
```
66+
67+
![Flatten call orig genotype](images/flatten_call_orig_genotype.png)
68+
69+
To further flatten the BigQuery table on the genotype array (i.e. have one
70+
genotype per row), you can add another explicit join with `call.genotype` as
71+
follows:
72+
73+
```
74+
#standardSQL
75+
SELECT
76+
reference_name, start_position, end_position, reference_bases,
77+
call.name AS call_name, genotype
78+
FROM
79+
`project.dataset.table` AS t,
80+
t.call AS call,
81+
call.genotype AS genotype
82+
```
83+
84+
![Flatten call, genotype](images/flatten_call_flatten_genotype.png)
85+
86+
Note that in this case, the call names are duplicated as each call contains
87+
two genotype values.
88+
89+
Let's add `alternate_bases` to the `SELECT` clause, which is an independently
90+
repeated record:
91+
92+
```
93+
#standardSQL
94+
SELECT
95+
reference_name, start_position, end_position, reference_bases,
96+
call.name AS call_name, genotype, alternate_bases
97+
FROM
98+
`project.dataset.table` AS t,
99+
t.call AS call,
100+
call.genotype AS genotype
101+
```
102+
103+
![Flatten call, genotype orig alt](images/flatten_call_flatten_genotype_orig_alt.png)
104+
105+
This result looks odd as it contains both `alternate_bases` even though
106+
we have flattened the `genotype` column. We really only want to return the
107+
particular alternate base that matches the index of the `genotype` column. This
108+
can be done using `ORDINAL` as follows:
109+
110+
```
111+
#standardSQL
112+
SELECT
113+
reference_name, start_position, end_position, reference_bases,
114+
call.name AS call_name, genotype,
115+
IF(genotype > 0, alternate_bases[ORDINAL(genotype)], NULL) AS alternate_bases
116+
FROM
117+
`project.dataset.table` AS t,
118+
t.call AS call,
119+
call.genotype AS genotype
120+
```
121+
122+
![Flatten call, genotype, alt](images/flatten_call_flatten_genotype_flatten_alt.png)
123+
124+
Note that the semantics of the value of the `genotype` column has changed
125+
as each row only contains a single alternate allele. As a result, you may
126+
decide to reformat that column using
127+
`IF(genotype > 0, 1, genotype) AS alt_genotype`, which results to:
128+
* `0` implying reference match.
129+
* `1` implying match to the particular alternate specified in the row.
130+
* `-1` implying not called. Note that Variant Transforms uses `-1` to denote
131+
genotypes that are not called (i.e. `.` in the VCF file).
132+
133+
Finally, to only include the `alternate_bases.alt` column, you need to
134+
explicitly flatten on the `alternate_bases` record as well and use the index as
135+
a filtering criteria as follows:
136+
137+
```
138+
#standardSQL
139+
SELECT
140+
reference_name, start_position, end_position, reference_bases,
141+
call.name AS call_name,
142+
IF(genotype > 0, 1, genotype) AS alt_genotype,
143+
IF(genotype > 0, alts.alt, NULL) AS alt
144+
FROM
145+
`project.dataset.table` AS t,
146+
t.call AS call,
147+
call.genotype AS genotype
148+
LEFT JOIN
149+
t.alternate_bases AS alts WITH OFFSET AS a_index
150+
WHERE
151+
genotype IN (a_index + 1, 0, -1)
152+
```
153+
154+
![Flatten call, genotype, only alt](images/flatten_call_flatten_genotype_only_alt.png)
155+
156+
Please note the explicit `LEFT JOIN` clause in this case as we also want to
157+
include any record that does not have an alternate base. You may choose to use
158+
`INNER JOIN` (or simply include
159+
`t.alternate_bases AS alts WITH OFFSET AS a_index` in the `FROM` clause) to
160+
only include records that have at least one alternate base.
161+
162+
163+
## Example query for flattening BigQuery table
164+
165+
With the background above, you can flatten the BigQuery table to not contain
166+
any repeated records using the query template shown below. Note that there are
167+
some semantic changes as the actual genotype value no longer corresponds to the
168+
index in the alternate base, so it's set to `1`, `0` or `-1` if it matches
169+
the alternate base, reference, or is not set, respectively.
170+
171+
```
172+
#standardSQL
173+
SELECT
174+
reference_name, start_position, end_position, reference_bases,
175+
IF(genotype > 0, alts.alt, NULL) AS alt,
176+
ARRAY_TO_STRING(t.names, ' ') AS names,
177+
t.quality,
178+
ARRAY_TO_STRING(t.filter, ' ') AS filter,
179+
call.name AS call_name,
180+
IF(genotype > 0, 1, genotype) AS alt_genotype,
181+
call.phaseset
182+
FROM
183+
`project.dataset.table` AS t,
184+
t.call AS call,
185+
call.genotype AS genotype
186+
LEFT JOIN
187+
t.alternate_bases AS alts WITH OFFSET AS a_index
188+
WHERE
189+
genotype IN (a_index + 1, 0, -1)
190+
```
191+
192+
For other repeated fields, you may choose to either concatenate them as a single
193+
field (i.e. use `ARRAY_TO_STRING`) or add them to the `FROM` or `LEFT JOIN`
194+
clause to explicitly flatten on those fields as well.
195+
196+
You may materialize the result of this query into a new table following the
197+
instructions
198+
[here](https://cloud.google.com/bigquery/docs/tables#creating_a_table_from_a_query_result).
199+
200+
### Example result
201+
202+
Running the above query on the following table:
203+
204+
![Flatten original full example](images/flatten_original_full_example.png)
205+
206+
Produces the following output:
207+
208+
![Flattened full example](images/flattened_full_example.png)
209+
24.8 KB
Loading
32.1 KB
Loading
25.9 KB
Loading
40.7 KB
Loading

docs/images/flatten_call_names.png

15.1 KB
Loading
18.7 KB
Loading
50.7 KB
Loading
18.4 KB
Loading

0 commit comments

Comments
 (0)