Features of batch synchronization - DataWorks - Alibaba Cloud Documentation Center

The batch synchronization feature in Data Integration provides Reader and Writer plugins. These plugins allow you to synchronize full or incremental data from a source database to a target database. You can perform this synchronization by defining the source and destination data sources and using DataWorks scheduling parameters. This topic describes the features of batch synchronization.

Core features

The following figure shows the features of batch synchronization.

离线同步能力

Capabilities	Description
Data synchronization between disparate data sources	Data Integration supports data synchronization for over 50 types of data sources, including relational databases, unstructured storage, big data storage, and message queues. You can use the Reader and Writer plugins to transfer data between any structured or semi-structured data sources by defining the source and destination data sources. For more information, see Supported data sources and synchronization solutions.
Data synchronization in complex network environments	Batch synchronization supports data synchronization in various environments. These environments include ApsaraDB databases, on-premises data centers, self-managed databases on ECS instances, or databases not hosted on Alibaba Cloud. You must ensure that there is network connectivity between the resource group and the source or destination. For more information about configuration, see Network connectivity solutions.
Synchronization scenarios	1. Supported synchronization modes Periodic full synchronization: Periodically overwrites the target table with all data from the source table. This is suitable for full update scenarios. Periodic incremental synchronization: Synchronizes only new or changed data from the source table on a daily or hourly basis. This is achieved using built-in scheduling parameters, such as `${bizdate}`, with the `WHERE` clause for data filtering. This process ensures that only specified data is retrieved and written to the corresponding time partition during each run. For more information, see Scenario: Configure a batch synchronization task for incremental data. Historical data backfill: You can use the Backfill Data feature in the Operation Center to backfill a large amount of historical data at once. This feature enables efficient archiving of historical data by running synchronization tasks in batches. Note For more information about scheduling parameters, see Common scenarios for scheduling parameters in Data Integration and Supported formats of scheduling parameters. 2. Supported source structures Single table to single table: The most basic synchronization method. It synchronizes data from one source table to one target table. Sharded tables to a single table: Automatically aggregates data from multiple physical tables, such as `order_01` and `order_02`, and writes the data to a single target table. Supported data sources include MySQL, SQL Server, Oracle, PostgreSQL, PolarDB, and AnalyticDB. For more information, see Configure a batch synchronization task for sharded tables.
Configuration methods	You can configure Data Integration batch synchronization tasks in the following ways. Codeless UI: Provides a guided visual interface to help you complete the configuration step by step. This method is easy to learn and ideal for getting started quickly. However, some advanced features are not available in this mode. Code editor: Allows you to use a JSON script to directly define the synchronization logic. This method is suitable for advanced users and supports more complex configuration scenarios and fine-grained control. Create using OpenAPI: Allows you to programmatically manage the full lifecycle of tasks using OpenAPI. Note For more information about task configuration features, see Function Overview.
Batch synchronization task O&M	Monitoring and alerts: Allows you to monitor the running status of batch synchronization tasks. This feature includes alerts for scenarios such as incomplete, failed, or completed tasks. Alerts can be sent to recipients through various methods, such as email, text messages, phone calls, DingTalk group chatbots, and webhooks. Data Quality: After a task is submitted and published, you can configure Data Quality monitoring rules for the target table in Operation Center. Currently, Data Quality monitoring rules are supported only for some database types. Data source environment isolation: You can bind a single data source name to two independent configurations for development and production. The data source is then automatically switched based on the environment during task execution. This feature separates the development environment for development and testing from the production environment for production scheduling. This separation prevents test operations from accidentally affecting production data.

Function Overview

任务配置

Feature	Description
Full or incremental data synchronization	Batch synchronization tasks can perform full or incremental data synchronization. To do this, configure Data Filtering and use scheduling parameters. The configuration method for incremental synchronization varies by plugin. For more information about configuring incremental data synchronization, see Scenario: Configure a batch synchronization task for incremental data.
Field mapping	Establish field mapping rules to write source data to the corresponding fields in the target based on the specified relationships. When you configure the mapping, ensure that the field types at both ends are compatible. Multiple field mapping methods are available: The codeless UI supports mapping by name, mapping by position, and custom field relationships. Data in unmapped fields is automatically ignored. To avoid write failures, ensure that the corresponding fields in the target have default values or allow null values. The code editor performs strict mapping based on the order of columns in the configuration. The number of fields in the reader and writer configurations must be identical. Otherwise, the task will fail to run. Synchronization tasks also provide dynamic assignment for target fields. You can flexibly configure constants, scheduling parameters, and built-in variables, such as `${bizdate}`. These parameters are assigned their final values during the scheduling phase.
Job rate limiting	The task concurrency control feature limits the maximum concurrency for reading from and writing to databases in Data Integration. The synchronization speed feature controls traffic to prevent excessive speed from overwhelming the source or destination data source. If no rate limit is set, the task uses the maximum transfer performance available in the current hardware environment.
Distributed task execution	For data sources that support distributed execution, task segmentation technology can be used to distribute a synchronization task for concurrent execution across multiple nodes. This allows synchronization speed to scale linearly with the cluster size, breaking through single-node performance bottlenecks. This mode is especially suitable for high-throughput, low-latency synchronization scenarios. It also efficiently schedules idle cluster resources, significantly improving hardware utilization.
Dirty data policy	Dirty data refers to data records that fail to be written to the target due to errors, such as type mismatches or constraint violations. Batch synchronization lets you define a dirty data policy. You can specify the number of tolerable dirty data records and their effect on the task. Ignore dirty data: Automatically filters out dirty data and writes only valid data. The task continues to run. Tolerate a limited number of dirty data records: Set a threshold N. If the number of dirty data records is less than or equal to N, the abnormal records are discarded and the task continues. If the number exceeds N, the task fails and exits. Do not tolerate dirty data: The task fails and exits immediately if any dirty data is encountered.
Time zone	To synchronize data across different time zones, you can set the source time zone to perform a time zone conversion.

What to do next

For more information about how to create a task, see the following topics:

Configure a batch synchronization task using the codeless UI

Configure a batch synchronization task using the code editor

Configure a sharding batch synchronization task