Real-time synchronization from MySQL database to Elasticsearch - DataWorks

Data Integration supports offline synchronization of entire databases from sources such as MySQL and PolarDB to Elasticsearch. This topic describes how to synchronize both full and incremental data from an entire MySQL database to Elasticsearch.

Prerequisites

You have purchased a Serverless resource group or an exclusive resource group for Data Integration.
You have created MySQL and Elasticsearch data sources. For more information, see Create a data source for Data Integration.
You have established network connectivity between the resource group and data sources. For more information, see Network connectivity solutions.

Procedure

1. Select the synchronization task type

Go to the Data Integration page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Integration > Data Integration. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Integration.
In the left-side navigation pane, click Synchronization Task, and then click Create Synchronization Task at the top of the page. On the page that appears, configure the following basic information:
- Data Source And Destination: MySQL→Elasticsearch
- New Task Name: Enter a custom name for the synchronization task.
- Synchronization Type: Real-time Database Synchronization.
- Synchronization Steps: Select both Full Synchronization and Incremental Synchronization.

2. Configure network and resources

In the Network And Resource Configuration section, select the Resource Group for the synchronization task. You can allocate the number of compute units (CUs) for Task Resource Usage.
For Source Data Source, select the added MySQL data source. For Destination Data Source, select the added Elasticsearch data source. Then, click Test Connectivity.
After you confirm that both the source and destination data sources are connected successfully, click Next.

3. Select the tables from which you want to synchronize data

In this step, you can select the tables from which you want to synchronize data in the Source Table list and click the icon to move the selected tables to the Selected Tables list.

4. Configure destination index mapping

After you select the tables to synchronize in the previous step, the tables are automatically displayed on this page. However, the properties of the destination indexes are in the pending refresh mapping status by default. You need to define and confirm the mapping relationship between the source tables and destination indexes, which determines how data is read and written. Then, click Refresh Mapping to proceed to the next step. You can either refresh the mapping directly or customize the destination index rules before refreshing the mapping.

Note

You can select the tables to be synchronized and click Batch Refresh Mapping. If no mapping rule is configured, the default index name rule is ${source table name}. If an index with the same name does not exist in the destination, a new index will be automatically created.
In the Custom Destination Index Name Mapping column, you can click the Edit button to customize the destination index name rule.
You can use built-in variables and manually entered strings to form the final destination index name. You can edit the built-in variables. For example, you can create a table name rule to add a suffix to the source table name as the destination index name.

a. Modify data type mappings for fields

Default mappings exist between data types of source fields and data types of destination fields. You can click Edit Mapping of Field Data Types in the upper-right corner of the Mapping Rules for Destination Tables section to configure data type mappings between source fields and destination fields based on your business requirements. After the configuration is complete, click Apply and Refresh Mapping.

2. Edit the destination index structure and add field assignments

When the destination index is in the To Be Created status, you can add fields to the destination index based on the original table schema. To do this, perform the following operations:

Add fields to the destination index
- Add fields to a single table: Click the button in the Destination Index Name column and add fields by configuring the Create Index Statement.
  - Dynamic Mapping Status: specifies whether to dynamically synchronize new fields in the source tables to the destination indexes during synchronization. Valid values:
    - true: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination indexes. Then, the fields can be searched in the indexes. This is the default value.
    - false: If the system detects that the source tables contain new fields, the system synchronizes the fields to the mapped destination indexes. However, the fields cannot be searched in the indexes after synchronization.
    - strict: If the system detects that the source tables contain new fields, the system does not synchronize the fields to the mapped destination indexes and reports an error. You can view the details of the error in the log information.
    For more information about dynamic mappings, see dynamic mapping.
  - Shards and Replicas: the number of primary shards for each destination index and the number of replica shards for each primary shard. The shards are distributed on different Elasticsearch nodes. This way, distributed searches can be performed and the query efficiency of Elasticsearch is improved. For more information, see Terms.
    Note
    The values of the Shards and Replicas parameters cannot be changed after you configure the parameters and run the solution. The default values of these parameters are 1.
- Add fields in batch: Select all tables to be synchronized, and at the bottom of the table, select Batch Modify > Destination Index Structure - Batch Add Fields.
Assign values to the fields. You can perform one of the following operations to assign values to the fields added in the previous step:
- Assign values to a single table: Click the Configure button in the Destination Index Field Assignment column to assign values to the destination table fields.
- Assign values in batch: At the bottom of the list, select Batch Modify > Destination Index Field Assignment to assign values to the same fields in multiple destination indexes in batch.
Note
When assigning values, you can assign constants and variables. You can switch the assignment mode by clicking the icon.

c. Configure DML processing rules

Data Integration provides default DML processing rules. You can also configure DML processing rules for destination tables based on your business requirements.

Configure DML processing rules for a single destination table: Find the destination table for which you want to configure DML processing rules and click Configure in the Configure DML Rule column to configure DML processing rules for the table.
Configure DML processing rules for multiple destination tables at a time: Select the destination tables for which you want to configure DML processing rules, click Batch Modify in the lower part of the page, and then click Configure DML Rule.

4. Customize advanced parameters

If you need to make fine-grained configurations for the task to meet custom synchronization requirements, you can click Configure in the Custom Advanced Parameters column to modify the advanced parameters.

Important

Before you modify the configurations of advanced parameters, make sure that you understand the meanings of the parameters to prevent unexpected errors or data quality issues.

5. Configure alert rules

To prevent the failure of the synchronization task from causing latency on business data synchronization, you can configure different alert rules for the synchronization task.

In the upper-right corner of the page, click Configure Alert Rule to go to the Configure Alert Rule panel.
In the Configure Alert Rule panel, click Add Alert Rule. In the Add Alert Rule dialog box, configure the parameters to configure an alert rule.
Note
The alert rules that you configure in this step take effect for the real-time synchronization subtask that will be generated by the synchronization task. After the configuration of the synchronization task is complete, you can refer to Manage real-time synchronization tasks to go to the Real-time Synchronization Task page and modify alert rules configured for the real-time synchronization subtask.
Manage alert rules.
You can enable or disable alert rules that are created. You can also specify different alert recipients based on the severity levels of alerts.

6. Configure advanced parameters

You can change the values of specific parameters configured for the synchronization task based on your business requirements. For example, you can specify an appropriate value for the Maximum read connections parameter to prevent the current synchronization task from imposing excessive pressure on the source database and data production from being affected.

Note

To prevent unexpected errors or data quality issues, we recommend that you understand the meanings of the parameters before you change the values of the parameters.

In the upper-right corner of the configuration page, click Configure Advanced Parameters.
In the Configure Advanced Parameters panel, change the values of the desired parameters.

7. Configure DDL processing rules

DDL operations may be performed on the source. You can click Configure DDL Capability in the upper-right corner of the page to configure rules to process DDL messages from the source based on your business requirements.

Note

For more information, see Configure rules to process DDL messages.

8. View and change resource groups

You can click Configure Resource Group in the upper-right corner of the page to view and change the resource groups that are used to run the current synchronization task.

9. Run the synchronization task

After the configuration of the synchronization task is complete, click Complete in the lower part of the page.
In the Nodes section of the Data Integration page, find the created synchronization task and click Start in the Actions column.
Click the name or ID of the synchronization task in the Tasks section and view the detailed running process of the synchronization task.

Perform O&M operations on the synchronization task

View the status of the synchronization task

After the synchronization task is created, you can go to the Synchronization Task page to view all synchronization tasks that are created in the workspace and the basic information of each synchronization task.

You can Start or Stop the synchronization task in the Operation column. In the More menu, you can Edit, View, and perform other operations on the synchronization task.
For tasks that have been started, you can see the basic status of the task in Execution Overview, or click the corresponding overview area to view execution details.
The real-time synchronization task from MySQL to Elasticsearch consists of three steps:
- Schema Migration: This tab displays information such as whether the destination table is a newly created table or an existing table. For a newly created table, the DDL statement that is used to create the table is displayed.
- Full Data Initialization: This tab displays information such as the source tables and destination tables involved in batch synchronization, the synchronization progress, and the number of data records that are synchronized.
- Real-time Synchronization: This tab displays statistical information about real-time synchronization, including the synchronization progress, DDL records, DML records, and alert information.

Rerun the synchronization task

In some special cases, if you add tables to or remove tables from the source, or change the schema or name of a destination table, you can click More in the Actions column of the synchronization task and then click Rerun to rerun the task after the change. During the rerun process, the synchronization task synchronizes data only from the newly added tables to the destination or only from the mapped source table to the destination table whose schema or name is changed.

If you want to rerun the synchronization task without modifying the configuration of the task, click More in the Actions column and then click Rerun to rerun the task to perform full synchronization and incremental synchronization again.
If you want to rerun the synchronization task after you add tables to or remove tables from the task, click Complete after the change. In this case, Apply Updates is displayed in the Actions column of the synchronization task. Click Apply Updates to trigger the system to rerun the synchronization task. During the rerun process, the synchronization task synchronizes data from the newly added tables to the destination. Data in the original tables is not synchronized again.