Definition #
Since events in Flink CDC flow from the upstream to the downstream in a pipeline manner, the whole ETL task is referred as a Data Pipeline.
Parameters #
A pipeline corresponds to a chain of operators in Flink.
To describe a Data Pipeline, the following parts are required:
the following parts are optional:
Example #
Only required #
We could use following yaml file to define a concise Data Pipeline describing synchronize all tables under MySQL app_db database to Doris :
pipeline: name: Sync MySQL Database to Doris parallelism: 2 source: type: mysql hostname: localhost port: 3306 username: root password: 123456 tables: app_db.\.* sink: type: doris fenodes: 127.0.0.1:8030 username: root password: "" With optional #
We could use following yaml file to define a complicated Data Pipeline describing synchronize all tables under MySQL app_db database to Doris and give specific target database name ods_db and specific target table name prefix ods_ :
source: type: mysql hostname: localhost port: 3306 username: root password: 123456 tables: app_db.\.* sink: type: doris fenodes: 127.0.0.1:8030 username: root password: "" transform: - source-table: adb.web_order01 projection: \*, format('%S', product_name) as product_name filter: addone(id) > 10 AND order_id > 100 description: project fields and filter - source-table: adb.web_order02 projection: \*, format('%S', product_name) as product_name filter: addone(id) > 20 AND order_id > 200 description: project fields and filter route: - source-table: app_db.orders sink-table: ods_db.ods_orders - source-table: app_db.shipments sink-table: ods_db.ods_shipments - source-table: app_db.products sink-table: ods_db.ods_products pipeline: name: Sync MySQL Database to Doris parallelism: 2 user-defined-function: - name: addone classpath: com.example.functions.AddOneFunctionClass - name: format classpath: com.example.functions.FormatFunctionClass Pipeline Configurations #
The following config options of Data Pipeline level are supported. Note that whilst the parameters are each individually optional, at least one of them must be specified. That is to say, The pipeline section is mandatory and cannot be empty.
| parameter | meaning | optional/required |
|---|---|---|
name | The name of the pipeline, which will be submitted to the Flink cluster as the job name. | optional |
parallelism | The global parallelism of the pipeline. Defaults to 1. | optional |
local-time-zone | The local time zone defines current session time zone id. | optional |
execution.runtime-mode | The runtime mode of the pipeline includes STREAMING and BATCH, with the default value being STREAMING. | optional |
schema.change.behavior | How to handle changes in schema. One of: exception, evolve, try_evolve, lenient (default) or ignore. | optional |
schema.operator.uid | The unique ID for schema operator. This ID will be used for inter-operator communications and must be unique across operators. Deprecated: use operator.uid.prefix instead. | optional |
schema-operator.rpc-timeout | The timeout time for SchemaOperator to wait downstream SchemaChangeEvent applying finished, the default value is 3 minutes. | optional |
operator.uid.prefix | The prefix to use for all pipeline operator UIDs. If not set, all pipeline operator UIDs will be generated by Flink. It is recommended to set this parameter to ensure stable and recognizable operator UIDs, which can help with stateful upgrades, troubleshooting, and Flink UI diagnostics. | optional |
NOTE: Whilst the above parameters are each individually optional, at least one of them must be specified. The pipeline section is mandatory and cannot be empty.