The Change Data Capture from MySQL to BigQuery using Debezium and Pub/Sub template is a streaming pipeline that reads Pub/Sub messages with change data from a MySQL database and writes the records to BigQuery. A Debezium connector captures changes to the MySQL database and publishes the changed data to Pub/Sub. The template then reads the Pub/Sub messages and writes them to BigQuery.
You can use this template to sync MySQL databases and BigQuery tables. The pipeline writes the changed data to a BigQuery staging table and intermittently updates a BigQuery table replicating the MySQL database.
Pipeline requirements
Template parameters
Required parameters
- inputSubscriptions: The comma-separated list of Pub/Sub input subscriptions to read from, in the format <SUBSCRIPTION_NAME>,<SUBSCRIPTION_NAME>, ....
- changeLogDataset: The BigQuery dataset to store the staging tables in, in the format <DATASET_NAME>.
- replicaDataset: The location of the BigQuery dataset to store the replica tables in, in the format <DATASET_NAME>.
Optional parameters
- inputTopics: Comma-separated list of PubSub topics to where CDC data is being pushed.
- updateFrequencySecs: The interval at which the pipeline updates the BigQuery table replicating the MySQL database.
- useSingleTopic: Set this to trueif you configure your Debezium connector to publish all table updates to a single topic. Defaults to: false.
- useStorageWriteApi: If true, the pipeline uses the BigQuery Storage Write API (https://cloud.google.com/bigquery/docs/write-api). The default value is false. For more information, see Using the Storage Write API (https://beam.apache.org/documentation/io/built-in/google-bigquery/#storage-write-api).
- useStorageWriteApiAtLeastOnce: When using the Storage Write API, specifies the write semantics. To use at-least once semantics (https://beam.apache.org/documentation/io/built-in/google-bigquery/#at-least-once-semantics), set this parameter to true. To use exactly-once semantics, set the parameter tofalse. This parameter applies only whenuseStorageWriteApiistrue. The default value isfalse.
- numStorageWriteApiStreams: When using the Storage Write API, specifies the number of write streams. If useStorageWriteApiistrueanduseStorageWriteApiAtLeastOnceisfalse, then you must set this parameter. Defaults to: 0.
- storageWriteApiTriggeringFrequencySec: When using the Storage Write API, specifies the triggering frequency, in seconds. If useStorageWriteApiistrueanduseStorageWriteApiAtLeastOnceisfalse, then you must set this parameter.
Run the template
To run this template, perform the following steps:
- On your local machine, clone the DataflowTemplates repository.
- Change to the v2/cdc-parentdirectory.
- Ensure that the Debezium connector is deployed.
- Using Maven, run the Dataflow template: mvn exec:java -pl cdc-change-applier -Dexec.args="--runner=DataflowRunner \ --inputSubscriptions=SUBSCRIPTIONS \ --updateFrequencySecs=300 \ --changeLogDataset=CHANGELOG_DATASET \ --replicaDataset=REPLICA_DATASET \ --project=PROJECT_ID \ --region=REGION_NAME" Replace the following: - PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow job
- SUBSCRIPTIONS: your comma-separated list of Pub/Sub subscription names
- CHANGELOG_DATASET: your BigQuery dataset for changelog data
- REPLICA_DATASET: your BigQuery dataset for replica tables
 
What's next
- Learn about Dataflow templates.
- See the list of Google-provided templates.