Batch synchronization tasks - DataWorks - Alibaba Cloud Documentation Center

DataWorks provides readers and writers for batch synchronization nodes to simplify the process of data synchronization between data sources. You can add data sources to a workspace in the visualized user interface (UI), and synchronize full or incremental data between these data sources by using the scheduling capability of DataWorks. This topic describes how to use a batch synchronization node to perform data synchronization. In this example, a MaxCompute data source is used as the source, and a Hologres data source is used as the destination.

Prerequisites

(Required if you use a RAM user to develop tasks) The desired RAM user is added to your DataWorks workspace as a member and is assigned the Development or Workspace Administrator role. The Workspace Administrator role has more permissions than necessary. Exercise caution when you assign the Workspace Administrator role. For more information about how to add a member, see Add workspace members and assign roles to them.
Note
If you use an Alibaba Cloud account, you can skip this operation.
A MaxCompute data source and a Hologres data source are added to the workspace, and the data sources have passed the network connectivity test. For more information, see Add and manage data sources.
Note
Batch synchronization nodes support multiple types of data sources. For more information, see Supported data source types and synchronization operations.

Limits

The batch synchronization feature provided by DataWorks does not support data synchronization across time zones. If the data sources of a batch synchronization node reside in a different time zone from the resource group that is used to run the task, errors may occur during data synchronization.

1. Create a batch synchronization node

Create a batch synchronization node. For more information, see Create an auto triggered node.

2. Configure network connectivity and a resource group

On the configuration tab of the batch synchronization node, configure the Source, Data Source Name, Destination, and Data Source Name parameters, select a resource group from the drop-down list in the middle, and then click Next. Make sure that the data sources and the resource group can connect to each other.

3. Configure the batch synchronization node

On the configuration tab of the batch synchronization node, you can use different methods to configure the node.

In most cases, we recommend that you configure a batch synchronization node by using the codeless UI, which is intuitive and convenient. If the data sources of the batch synchronization node do not support the codeless UI, you can click Code Editor in the top toolbar of the configuration tab and configure the batch synchronization node by using the code editor.

Important

If you switch from the codeless UI to the code editor when you configure a batch synchronization node, you can no longer switch back to the codeless UI. If you want to use the codeless UI, you can only create another batch synchronization node.

Codeless UI

Configure the source and destination

In the Configure Source and Destination section, configure the following parameters based on your business requirements.

Parameter	Source	Destination
Tunnel Resource Group	The default value is `Common transmission resources`, which indicates a default resource group provided by the system. You can also click New purchase to purchase more resources.	N/A.
schema/Schema	The default value is `default`. You can select a created schema.	The default value is `public`. You can select a created schema.
Table	The name of the table from which you want to read data.	The name of the table to which you want to write data.
Filtering Method	Valid values: Partition Filtering: You can specify the partition in which data is stored to filter data. In most cases, this method is used to filter data that is divided by time or another dimension. Data Filter: You can specify a specific condition, such as a WHERE clause, to extract data that meets the condition.	N/A.
Partition Filtering	If you set the Filtering Method parameter to Partition Filtering, you can specify the partition in which data is stored to filter data.	N/A.
Data filtering	If you set the Filtering Method parameter to Data Filter, you can specify a WHERE clause to filter data.	N/A.
Partition information	If you set the Filtering Method parameter to Partition Filtering, you can configure the related parameters. The system automatically scans and displays the partition information in the source table. If you want to configure filter conditions for multiple partitions, you can click Add Partition to add partitions and configure a filter condition for each partition. Important During debugging, you must specify variable values in the debugging configurations. At runtime, the system dynamically replaces the variables with actual values.	If the destination table is a partitioned table, the system automatically scans the partition information of the table. You can configure a condition for writing data to partitions. If the destination table is a non-partitioned table, no partition information is displayed by default.
If partitions do not exist	If you set the Filtering Method parameter to Partition Filtering, you can configure the related parameters. You can configure this parameter to specify a processing policy that is used when the specified partition does not exist. You can set this parameter to Error or the partitions are ignored and tasks are normally run.	N/A.
Write conflict strategy	N/A.	Valid values: Replace: If the destination table contains data records specified by the same primary key or unique key as the mapped source table, these data records in the destination table are replaced. Otherwise, new data records are inserted into the destination table. Ignore: If the destination table contains data records specified by the same primary key or unique key as the mapped source table, these data records in the destination table are ignored and no operations are performed. Update: If the destination table contains data records specified by the same primary key or unique key as the mapped source table, these data records in the destination table are updated.
Whether to clear the Hologres table before synchronization	N/A.	You can select Clear the target table (true) or Do not empty the target table (false).
Maximum Connection Count	N/A.	The maximum number of Java Database Connectivity (JDBC) connections that are allowed for writing data to the destination table. This parameter takes effect only if SQL statements are executed to write data. If SQL statements are executed to write data, the INSERT INTO write mode is used. Before you start the batch synchronization node, make sure that sufficient idle connections are available for the related Hologres instance. Note The maximum value of this parameter is 9.

Configure field mappings

In the Field Mapping section, click Add a row in the table that displays source fields to add a field and establish a mapping between the field and a destination field. You can also click Delete or Revised displayed for a mapping to delete or modify the mapping.

Configure channel control settings

In the Channel Control section, configure the parameters related to channel control.

Note

You can enable the distributed execution mode for the batch synchronization node only if you set the Task Expected Maximum Concurrency parameter to a value greater than or equal to 8.

Code editor

The following sample code provides an example on how to configure the batch synchronization node:

Note

For details about the script mode in the data source list, see the list of data sources.

{ "transform": false, "type": "job", "version": "2.0", "steps": [ { "stepType": "odps",// The source type. "parameter": { "schema": "default", "partition": [ "year=${bizdate},month=" ], "datasource": "The name of the source", "envType": 1, "successOnNoPartition": false, "tunnelQuota": "default", "isSupportThreeModel": true, "column": [ "Field 1", "Field 2", "Field 3", "Field 4", "Field..." ], "tableComment": "", "enableWhere": false, "table": "The name of the source table." }, "name": "Reader", "category": "reader" }, { "stepType": "holo",// The destination type. "parameter": { "selectedDatabase": "public", "maxConnectionCount": 9, "partition": "order_month=${bizdate}", "truncate": "false", "datasource": "The name of the destination", "conflictMode": "ignore", "envType": 1, "column": [ "Field 1", "Field 2", "Field 3", "Field 4", "Field..." ], "tableComment": "", "table": "The name of the destination table" }, "name": "Writer", "category": "writer" }, { "name": "Processor", "stepType": null, "category": "processor", "copies": 1, "parameter": { "nodes": [], "edges": [], "groups": [], "version": "2.0" } } ], "setting": { "executeMode": null, "failoverEnable": null, "errorLimit": { "record": "0" }, "speed": { "concurrent": 2, "throttle": false } }, "order": { "hops": [ { "from": "Reader", "to": "Writer" } ] } }

4. Configure debugging parameters

In the right-side navigation pane of the configuration tab of the batch synchronization node, click Debugging Configurations. On the Debugging Configurations tab, configure the following parameters. These parameters are used to debug and run the batch synchronization node.

Parameter	Description
Resource Group	Select the serverless resource group that you specify in 2. Configure network connectivity and a resource group.
Script Parameters	If you configure scheduling parameters for the batch synchronization node, assign values to the scheduling parameters when you configure debugging parameters for the node. This ensures that the batch synchronization node can obtain the scheduling parameters when you debug and run the node. Note When synchronizing a partitioned table with partition filtering enabled by default, and the partition parameter is set to ${bizdate}, configure bizdate with a valid partition value from the source table.

What to do next

If you want the system to periodically schedule the batch synchronization node, configure scheduling properties for the node based on your business requirements. For more information, see Node scheduling.
After the configuration of the batch synchronization node is complete, deploy the node. For more information, see Node or workflow deployment.
After the batch synchronization node is deployed, view the running information of the node in Operation Center. For information about Operation Center, see Getting started with Operation Center.