Official

BigQuery destination integration documentation

The BigQuery plugin syncs data from any CloudQuery source plugin(s) to a BigQuery database running on Google Cloud Platform

Publisher

cloudquery

Repository

github.com

Latest version

v4.6.0

Type

Destination

Platforms

Date Published

Start Syncing

Documentation Changelog

Loading plugin documentation

Overview Types Licenses

Overview #

BigQuery Destination Plugin

The BigQuery plugin syncs data from any CloudQuery source plugin(s) to a BigQuery database running on Google Cloud Platform.

The plugin currently only supports a streaming mode through the legacy streaming API. This is suitable for small- to medium-sized datasets, and will stream the results directly to the BigQuery database. A batch mode of operation is being developed to support larger datasets, but this is not currently supported.

Streaming is not available for the Google Cloud free tier.

Before you begin #

Make sure that billing is enabled for your Cloud project. Learn how to check if billing is enabled on a project.
Create a BigQuery dataset that will contain the tables synced by CloudQuery. CloudQuery will automatically create the tables as part of a migration run on the first sync.
Ensure that you have write access to the dataset. See Required Permissions for details.

Example config #

kind: destination spec:  name: bigquery  path: cloudquery/bigquery  registry: cloudquery  version: "v4.6.0"  write_mode: "append"  send_sync_summary: true  # Learn more about the configuration options at https://cql.ink/bigquery_destination  spec:  project_id: ${PROJECT_ID}  dataset_id: ${DATASET_ID}  # Optional parameters  # dataset_location: ""  # time_partitioning: none # options: "none", "hour", "day", "month", "year"  # time_partitioning_expiration: 0 # duration, e.g. "24h" or "720h" (30 days)  # service_account_key_json: ""  # endpoint: ""  # batch_size: 10000  # batch_size_bytes: 5242880 # 5 MiB  # batch_timeout: 10s  # client_project_id: "*detect-project-id*"

This example above expects the following environment variables to be set:

PROJECT_ID - The Google Cloud Project ID
DATASET_ID - The Google Cloud BigQuery Dataset ID

client_project_id variable can be used to run BigQuery queries in a project different from where the destination table is located. If you set client_project_id to *detect-project-id*, it will automatically detect the project ID from the environment variable or application default credentials.

The BigQuery destination utilizes batching, and supports batch_size and batch_size_bytes.

Note that the BigQuery plugin only supports the append write mode.

Authentication #

The BigQuery plugin authenticates using your Application Default Credentials. Available options are all the same options described here in detail:

Local Environment:

gcloud auth application-default login (recommended when running locally)

Google Cloud cloud-based development environment:

When you run on Cloud Shell or Cloud Code credentials are already available.

Google Cloud containerized environment:

When running on GKE use workload identity.

Google Cloud services that support attaching a service account:

Services such as Compute Engine, App Engine and functions supporting attaching a user-managed service account which will CloudQuery will be able to utilize.

On-premises or another cloud provider

The suggested way is to use Workload identity federation
If not available you can always use service account keys and export the location of the key via GOOGLE_APPLICATION_CREDENTIALS. (Not recommended as long-lived keys are a security risk)

BigQuery Spec #

This is the top-level spec used by the BigQuery destination plugin.

project_id (string) (required)
The id of the project where the destination BigQuery database resides.

dataset_id (string) (required)

The name of the BigQuery dataset within the project, e.g. my_dataset. This dataset needs to be created before running a sync or migration.

dataset_location (string) (optional)

The data location of the BigQuery dataset. If set, will be used as the default location for job operations. Pro-tip: this can solve "dataset not found" issues for newly created datasets.

time_partitioning (string) (options: none, hour, day) (default: none)

The time partitioning to use when creating tables. The partition time column used will always be _cq_sync_time so that all rows for a sync run will be partitioned on the hour/day the sync started.

time_partitioning_expiration (duration) (optional)
The time after which the partition will be automatically deleted. The duration is specified in seconds, minutes or hours, e.g. 3600s, 60m, 24h, 720h. This option is only valid if time_partitioning is set a value other than none.
service_account_key_json (string) (optional) (default: empty).
GCP service account key content. This allows for using different service accounts for the GCP source and BigQuery destination. If using service account keys, it is best to use environment or file variable substitution.
endpoint (string) (optional)
The BigQuery API endpoint to use. This is useful for testing against a local emulator.
batch_size (integer) (optional) (default: 10000)
Number of records to write before starting a new object.
batch_size_bytes (integer) (optional) (default: 5242880 (5 MiB))
Number of bytes (as Arrow buffer size) to write before starting a new object.
batch_timeout (duration) (optional) (default: 10s (10 seconds))
Maximum interval between batch writes.
text_embeddings (object) (optional)
Configuration for creating text embeddings for certain tables using a Vertex AI model.
A remote model must be attached to the dataset before using this feature, and its name supplied in the remote_model_name field.
- remote_model_name (string) (required)
  The name of the remote model to use for text embeddings.
- tables (array) (required)
  The tables to create text embeddings for. Each one must have its own configuration.
  
  source_table_name (string) (required)
  The name of the source table to create text embeddings for.
  
  target_table_name (string) (required)
  The name of the target table in which to store the embeddings.
  
  embed_columns (array) (required)
  The columns to use as content for the embeddings generation function. They will be concatenated in the order they are provided.
  
  metadata_columns (array) (optional)
  Which columns to copy as-is from the source table to the target table. _cq_id is always included.
- text_splitter (object) (optional)
  The text splitter configuration to use for text embeddings. Currently only recursive_text is supported.
  
  recursive_text (object) (required)
  
  chunk_size (integer) (required)
  The size of the chunks in characters (not tokens). Defaults to 1000.
  
  chunk_overlap (integer) (required)
  The overlap between chunks in characters (not tokens). Defaults to 100.

Underlying library #

We use the official cloud.google.com/go/bigquery package for database connection.

Types #

BigQuery Types

The BigQuery destination (v3.0.0 and later) supports most Apache Arrow types. The following table shows the supported types and how they are mapped to BigQuery data types.

Arrow Column Type	Supported?	BigQuery Type
Binary	✅ Yes	`BYTES`
Boolean	✅ Yes	`BOOL`
Date32	✅ Yes	`DATE`
Date64	✅ Yes	`DATE`
Decimal	✅ Yes	`BIGNUMERIC`
Dense Union	❌ No
Dictionary	❌ No
Duration	✅ Yes	`INT64`
Fixed Size List	✅ Yes	(Repeated column) †
Float16	✅ Yes	`FLOAT64`
Float32	✅ Yes	`FLOAT64`
Float64	✅ Yes	`FLOAT64`
Inet	✅ Yes	`STRING`
Int8	✅ Yes	`INT64`
Int16	✅ Yes	`INT64`
Int32	✅ Yes	`INT64`
Int64	✅ Yes	`INT64`
Interval[DayTime]	✅ Yes	`RECORD<days: INT64, milliseconds: INT64>`
Interval[MonthDayNano]	✅ Yes	`RECORD<months: INT64, days: int64, nanoseconds: int64>`
Interval[Month]	✅ Yes	`RECORD<months: INT64>`
JSON	✅ Yes	`JSON`
Large Binary	✅ Yes	`BYTES`
Large List	✅ Yes	(Repeated column) †
Large String	✅ Yes	`STRING`
List	✅ Yes	(Repeated column) †
MAC	✅ Yes	`STRING`
Map	❌ No
String	✅ Yes	`STRING`
Struct	✅ Yes	`RECORD`
Timestamp	✅ Yes	`TIMESTAMP`
UUID	✅ Yes	`STRING`
Uint8	✅ Yes	`INT64`
Uint16	✅ Yes	`INT64`
Uint32	✅ Yes	`INT64`
Uint64	✅ Yes	`NUMERIC`
Union	❌ No

Notes #

† Repeated columns in BigQuery do not support null values. Right now, if an array contains null values, these null values will be dropped when writing to BigQuery. Also, because we use REPEATED columns to represent lists, lists of lists are not supported right now.

Licenses #

The following tools / packages are used in this plugin:

Name	License
cloud.google.com/go	Apache-2.0
cloud.google.com/go/auth	Apache-2.0
cloud.google.com/go/auth/oauth2adapt	Apache-2.0
cloud.google.com/go/bigquery	Apache-2.0
cloud.google.com/go/compute/metadata	Apache-2.0
cloud.google.com/go/iam	Apache-2.0
github.com/adrg/xdg	MIT
github.com/apache/arrow-go/v18	Apache-2.0
github.com/apache/arrow/go/v13	Apache-2.0
github.com/apache/arrow/go/v15	Apache-2.0
github.com/apapsch/go-jsonmerge/v2	MIT
github.com/aws/aws-sdk-go-v2	Apache-2.0
github.com/aws/aws-sdk-go-v2/config	Apache-2.0
github.com/aws/aws-sdk-go-v2/credentials	Apache-2.0
github.com/aws/aws-sdk-go-v2/feature/ec2/imds	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/configsources	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/endpoints/v2	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/ini	Apache-2.0
github.com/aws/aws-sdk-go-v2/internal/sync/singleflight	BSD-3-Clause
github.com/aws/aws-sdk-go-v2/service/internal/accept-encoding	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/internal/presigned-url	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/licensemanager	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/marketplacemetering	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/sso	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/ssooidc	Apache-2.0
github.com/aws/aws-sdk-go-v2/service/sts	Apache-2.0
github.com/aws/smithy-go	Apache-2.0
github.com/aws/smithy-go/internal/sync/singleflight	BSD-3-Clause
github.com/bahlo/generic-list-go	BSD-3-Clause
github.com/buger/jsonparser	MIT
github.com/cenkalti/backoff/v5	MIT
github.com/cloudquery/cloudquery-api-go	MPL-2.0
github.com/cloudquery/codegen/jsonschema/docs	MPL-2.0
github.com/cloudquery/plugin-pb-go	MPL-2.0
github.com/cloudquery/plugin-sdk/v2/internal/glob	MIT
github.com/cloudquery/plugin-sdk/v2/schema	MIT
github.com/cloudquery/plugin-sdk/v2/types	MPL-2.0
github.com/cloudquery/plugin-sdk/v4	MPL-2.0
github.com/cloudquery/plugin-sdk/v4/glob	MIT
github.com/cloudquery/plugin-sdk/v4/scalar	MIT
github.com/davecgh/go-spew/spew	ISC
github.com/felixge/httpsnoop	MIT
github.com/ghodss/yaml	MIT
github.com/go-logr/logr	Apache-2.0
github.com/go-logr/stdr	Apache-2.0
github.com/goccy/go-json	MIT
github.com/google/flatbuffers/go	Apache-2.0
github.com/google/s2a-go	Apache-2.0
github.com/google/uuid	BSD-3-Clause
github.com/googleapis/enterprise-certificate-proxy/client	Apache-2.0
github.com/googleapis/gax-go/v2	BSD-3-Clause
github.com/grpc-ecosystem/go-grpc-middleware/v2/interceptors	Apache-2.0
github.com/grpc-ecosystem/grpc-gateway/v2	BSD-3-Clause
github.com/hashicorp/go-cleanhttp	MPL-2.0
github.com/hashicorp/go-retryablehttp	MPL-2.0
github.com/invopop/jsonschema	MIT
github.com/klauspost/compress	Apache-2.0
github.com/klauspost/compress/internal/snapref	BSD-3-Clause
github.com/klauspost/compress/zstd/internal/xxhash	MIT
github.com/mailru/easyjson	MIT
github.com/mattn/go-colorable	MIT
github.com/mattn/go-isatty	MIT
github.com/oapi-codegen/runtime	Apache-2.0
github.com/pierrec/lz4/v4	BSD-3-Clause
github.com/pmezard/go-difflib/difflib	BSD-3-Clause
github.com/rs/zerolog	MIT
github.com/samber/lo	MIT
github.com/santhosh-tekuri/jsonschema/v6	Apache-2.0
github.com/spf13/cobra	Apache-2.0
github.com/spf13/pflag	BSD-3-Clause
github.com/stoewer/go-strcase	MIT
github.com/stretchr/testify	MIT
github.com/thoas/go-funk	MIT
github.com/wk8/go-ordered-map/v2	Apache-2.0
github.com/zeebo/xxh3	BSD-2-Clause
go.opentelemetry.io/auto/sdk	Apache-2.0
go.opentelemetry.io/contrib/instrumentation/google.golang.org/grpc/otelgrpc	Apache-2.0
go.opentelemetry.io/contrib/instrumentation/net/http/otelhttp	Apache-2.0
go.opentelemetry.io/otel	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlplog/otlploghttp	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlpmetric/otlpmetrichttp	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace	Apache-2.0
go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracehttp	Apache-2.0
go.opentelemetry.io/otel/log	Apache-2.0
go.opentelemetry.io/otel/metric	Apache-2.0
go.opentelemetry.io/otel/sdk	Apache-2.0
go.opentelemetry.io/otel/sdk/log	Apache-2.0
go.opentelemetry.io/otel/sdk/metric	Apache-2.0
go.opentelemetry.io/otel/trace	Apache-2.0
go.opentelemetry.io/proto/otlp	Apache-2.0
golang.org/x/crypto	BSD-3-Clause
golang.org/x/exp	BSD-3-Clause
golang.org/x/net	BSD-3-Clause
golang.org/x/oauth2	BSD-3-Clause
golang.org/x/sync	BSD-3-Clause
golang.org/x/sys	BSD-3-Clause
golang.org/x/text	BSD-3-Clause
golang.org/x/time/rate	BSD-3-Clause
golang.org/x/xerrors	BSD-3-Clause
google.golang.org/api	BSD-3-Clause
google.golang.org/api/internal/third_party/uritemplates	BSD-3-Clause
google.golang.org/genproto/googleapis/api	Apache-2.0
google.golang.org/genproto/googleapis/rpc	Apache-2.0
google.golang.org/genproto/googleapis/type/expr	Apache-2.0
google.golang.org/grpc	Apache-2.0
google.golang.org/protobuf	BSD-3-Clause
gopkg.in/yaml.v2	Apache-2.0
gopkg.in/yaml.v3	MIT