Datastream to MySQL or PostgreSQL (Stream) template

The Datastream to SQL template is a streaming pipeline that reads Datastream data and replicates it into any MySQL or PostgreSQL database. The template reads data from Cloud Storage using Pub/Sub notifications and replicates this data into SQL replica tables. Specify either the gcsPubSubSubscription parameter to read data from Pub/Sub notifications, OR provide the inputFilePattern parameter to directly read data from files in Cloud Storage.

The template does not support data definition language (DDL) and expects that all tables already exist in the database. Replication uses Dataflow stateful transforms to filter stale data and ensure consistency in out of order data. For example, if a more recent version of a row has already passed through, a late arriving version of that row is ignored. The data manipulation language (DML) that executes is a best attempt to perfectly replicate source to target data. The DML statements executed follow the following rules:

If a primary key exists, insert and update operations use upsert syntax (ie. INSERT INTO table VALUES (...) ON CONFLICT (...) DO UPDATE).
If primary keys exist, deletes are replicated as a delete DML.
If no primary key exists, both insert and update operations are inserted into the table.
If no primary keys exist, deletes are ignored.

If you are using the Oracle to Postgres utilities, add ROWID in SQL as the primary key when none exists.

Pipeline requirements

A Datastream stream that is ready to or already replicating data.
Cloud Storage Pub/Sub notifications are enabled for the Datastream data.
A PostgreSQL database was seeded with the required schema.
Network access between Dataflow workers and PostgreSQL is set up.

Template parameters

Required parameters

inputFilePattern: The file location for the Datastream files in Cloud Storage to replicate. This file location is typically the root path for the stream.
databaseHost: The SQL host to connect on.
databaseUser: The SQL user with all required permissions to write to all tables in replication.
databasePassword: The password for the SQL user.

Optional parameters

gcsPubSubSubscription: The Pub/Sub subscription with Datastream file notifications. For example, projects/<PROJECT_ID>/subscriptions/<SUBSCRIPTION_ID>.
inputFileFormat: The format of the output file produced by Datastream. For example, avro or json. Defaults to avro.
streamName: The name or template for the stream to poll for schema information. The default value is {_metadata_stream}.
rfcStartDateTime: The starting DateTime used to fetch from Cloud Storage (https://tools.ietf.org/html/rfc3339). Defaults to: 1970-01-01T00:00:00.00Z.
dataStreamRootUrl: Datastream API Root URL. Defaults to: https://datastream.googleapis.com/.
databaseType: The database type to write to (for example, Postgres). Defaults to: postgres.
databasePort: The SQL database port to connect to. The default value is 5432.
databaseName: The name of the SQL database to connect to. The default value is postgres.
defaultCasing: A Toggle for table casing behavior. For example,(ie.LOWERCASE = mytable -> mytable, UPPERCASE = mytable -> MYTABLECAMEL = my_table -> myTable, SNAKE = myTable -> my_table. Defaults to: LOWERCASE.
columnCasing: A toggle for target column name casing. LOWERCASE (default): my_column -> my_column. UPPERCASE: my_column -> MY_COLUMN. CAMEL: my_column -> myColumn. SNAKE: myColumn -> my_column.
schemaMap: A map of key/values used to dictate schema name changes (ie. old_name:new_name,CaseError:case_error). Defaults to empty.
customConnectionString: Optional connection string which will be used instead of the default database string.
numThreads: Determines key parallelism of Format to DML step, specifically, the value is passed into Reshuffle.withNumBuckets. Defaults to: 100.
databaseLoginTimeout: The timeout in seconds for database login attempts. This helps prevent connection hangs when multiple workers try to connect simultaneously.
datastreamSourceType: Override the source type detection for Datastream CDC data. When specified, this value will be used instead of deriving the source type from the read_method field. Valid values include 'mysql', 'postgresql', 'oracle', etc. This parameter is useful when the read_method field contains 'cdc' and the actual source type cannot be determined automatically.
orderByIncludesIsDeleted: Order by configurations for data should include prioritizing data which is not deleted. Defaults to: false.

Run the template

Console

Go to the Dataflow Create job from template page.

Go to Create job from template

In the Job name field, enter a unique job name.
Optional: For Regional endpoint, select a value from the drop-down menu. The default region is us-central1.
For a list of regions where you can run a Dataflow job, see Dataflow locations.
From the Dataflow template drop-down menu, select the Cloud Datastream to SQL template.
In the provided parameter fields, enter your parameter values.
Click Run job.

gcloud

In your shell or terminal, run the template:

gcloud dataflow flex-template run JOB_NAME \  --project=PROJECT_ID \  --region=REGION_NAME \  --enable-streaming-engine \  --template-file-gcs-location=gs://dataflow-templates-REGION_NAME/VERSION/flex/Cloud_Datastream_to_SQL \  --parameters \ inputFilePattern=GCS_FILE_PATH,\ gcsPubSubSubscription=GCS_SUBSCRIPTION_NAME,\ databaseHost=DATABASE_HOST,\ databaseUser=DATABASE_USER,\ databasePassword=DATABASE_PASSWORD

Replace the following:

PROJECT_ID: the Google Cloud project ID where you want to run the Dataflow job
JOB_NAME: a unique job name of your choice
REGION_NAME: the region where you want to deploy your Dataflow job—for example, us-central1
VERSION: the version of the template that you want to use You can use the following values: latest to use the latest version of the template, which is available in the non-dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/latest/ the version name, like 2023-09-12-00_RC00, to use a specific version of the template, which can be found nested in the respective dated parent folder in the bucket— gs://dataflow-templates-REGION_NAME/ Caution: The latest version of templates might update with breaking changes. Your production environments should use templates kept in the most recent dated parent folder to prevent these breaking changes from affecting your production workflows.
GCS_FILE_PATH: the Cloud Storage path to Datastream data. For example: gs://bucket/path/to/data/
GCS_SUBSCRIPTION_NAME: the Pub/Sub subscription to read changed files from. For example: projects/my-project-id/subscriptions/my-subscription-id.
DATABASE_HOST: your SQL host IP.
DATABASE_USER: your SQL user.
DATABASE_PASSWORD: your SQL password.

API

To run the template using the REST API, send an HTTP POST request. For more information on the API and its authorization scopes, see projects.templates.launch.

POST https://dataflow.googleapis.com/v1b3/projects/PROJECT_ID/locations/LOCATION/flexTemplates:launch {  "launch_parameter": {  "jobName": "JOB_NAME",  "parameters": {  "inputFilePattern": "GCS_FILE_PATH",  "gcsPubSubSubscription": "GCS_SUBSCRIPTION_NAME",  "databaseHost": "DATABASE_HOST",  "databaseUser": "DATABASE_USER",  "databasePassword": "DATABASE_PASSWORD"  },  "containerSpecGcsPath": "gs://dataflow-templates-LOCATION/VERSION/flex/Cloud_Datastream_to_SQL",  } }