Definition #

Since events in Flink CDC flow from the upstream to the downstream in a pipeline manner, the whole ETL task is referred as a Data Pipeline.

Parameters #

A pipeline corresponds to a chain of operators in Flink.
To describe a Data Pipeline, the following parts are required:

the following parts are optional:

Example #

Only required #

We could use following yaml file to define a concise Data Pipeline describing synchronize all tables under MySQL app_db database to Doris :

 pipeline:  name: Sync MySQL Database to Doris  parallelism: 2   source:  type: mysql  hostname: localhost  port: 3306  username: root  password: 123456  tables: app_db.\.*   sink:  type: doris  fenodes: 127.0.0.1:8030  username: root  password: ""

With optional #

We could use following yaml file to define a complicated Data Pipeline describing synchronize all tables under MySQL app_db database to Doris and give specific target database name ods_db and specific target table name prefix ods_ :

 source:  type: mysql  hostname: localhost  port: 3306  username: root  password: 123456  tables: app_db.\.*   sink:  type: doris  fenodes: 127.0.0.1:8030  username: root  password: ""   transform:  - source-table: adb.web_order01  projection: \*, format('%S', product_name) as product_name  filter: addone(id) > 10 AND order_id > 100  description: project fields and filter  - source-table: adb.web_order02  projection: \*, format('%S', product_name) as product_name  filter: addone(id) > 20 AND order_id > 200  description: project fields and filter   route:  - source-table: app_db.orders  sink-table: ods_db.ods_orders  - source-table: app_db.shipments  sink-table: ods_db.ods_shipments  - source-table: app_db.products  sink-table: ods_db.ods_products   pipeline:  name: Sync MySQL Database to Doris  parallelism: 2  user-defined-function:  - name: addone  classpath: com.example.functions.AddOneFunctionClass  - name: format  classpath: com.example.functions.FormatFunctionClass

Pipeline Configurations #

The following config options of Data Pipeline level are supported. Note that whilst the parameters are each individually optional, at least one of them must be specified. That is to say, The pipeline section is mandatory and cannot be empty.

parameter	meaning	optional/required
`name`	The name of the pipeline, which will be submitted to the Flink cluster as the job name.	optional
`parallelism`	The global parallelism of the pipeline. Defaults to 1.	optional
`local-time-zone`	The local time zone defines current session time zone id.	optional
`execution.runtime-mode`	The runtime mode of the pipeline includes STREAMING and BATCH, with the default value being STREAMING.	optional
`schema.change.behavior`	How to handle changes in schema. One of: `exception`, `evolve`, `try_evolve`, `lenient` (default) or `ignore`.	optional
`schema.operator.uid`	The unique ID for schema operator. This ID will be used for inter-operator communications and must be unique across operators. Deprecated: use `operator.uid.prefix` instead.	optional
`schema-operator.rpc-timeout`	The timeout time for SchemaOperator to wait downstream SchemaChangeEvent applying finished, the default value is 3 minutes.	optional
`operator.uid.prefix`	The prefix to use for all pipeline operator UIDs. If not set, all pipeline operator UIDs will be generated by Flink. It is recommended to set this parameter to ensure stable and recognizable operator UIDs, which can help with stateful upgrades, troubleshooting, and Flink UI diagnostics.	optional

NOTE: Whilst the above parameters are each individually optional, at least one of them must be specified. The pipeline section is mandatory and cannot be empty.