Create an E-MapReduce (EMR) Spark SQL node to process structured data using a distributed SQL query engine. This improves job execution efficiency.
Prerequisites
Before developing nodes, if you need to customize the component environment, you can create a custom image based on the official image
dataworks_emr_base_task_pod
and use the custom image in DataStudio.For example, you can replace Spark JAR packages or include specific
libraries
,files
, orJAR packages
when creating the custom image.An EMR cluster is registered with DataWorks. For more information, see Legacy Data Development: Attach EMR computing resources.
A Resource Access Management (RAM) user is required to develop tasks. This RAM user must be added to the workspace and granted the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions and must be granted with caution. For more information about how to add members, see Add members to a workspace.
A resource group is purchased and configured. This includes attaching the resource group to a workspace and configuring the network. For more information, see Add and use a Serverless resource group.
A business flow is created. In DataStudio, development operations for different engines are organized within business flows. You must create a business flow before you can create a node. For more information, see Create a business flow.
If you require a specific development environment, you can use the custom image feature in DataWorks to build a component image for task execution. For more information, see Custom images.
Limits
This type of node can be run only on a serverless resource group or an exclusive resource group for scheduling. We recommend that you use a serverless resource group. If you need to use an image in DataStudio, use a serverless computing resource group.
To manage the metadata of DataLake or custom clusters in DataWorks, EMR-HOOK must be configured on the cluster. If EMR-HOOK is not configured, you cannot view metadata in real time, generate audit logs, view data lineage, or perform EMR-related administration tasks in DataWorks. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Spark SQL.
Data lineage is not available for EMR on ACK Spark clusters. Data lineage is available for EMR Serverless Spark clusters.
You can use the visualization feature to register functions for DataLake clusters and custom clusters, but not for EMR on ACK Spark clusters or EMR Serverless Spark clusters.
Precautions
If you have enabled Ranger permission control for the Spark component in the EMR cluster that is associated with your workspace, take note of the following items:
When you use the default image to run Spark tasks, Ranger permission control takes effect by default.
If you want to use a custom image to run Spark tasks, you must submit a ticket to contact technical support to upgrade the image. This way, Ranger permission control can take effect when you run Spark tasks.
1. Create an EMR Spark SQL node
Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose . On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create an EMR Spark SQL node.
Right-click the target business flow and choose
.NoteAlternatively, you can hover over New and choose
.In the New Node dialog box, set Name, Engine Instance, Node Type, and Path. Click Confirm. The configuration tab of the EMR Spark SQL node appears.
NoteThe node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).
2. Develop an EMR Spark SQL task
On the EMR Spark SQL node configuration tab, double-click the node that you created. The task development tab appears.
Develop SQL code
In the SQL editing area, develop the task code. You can define variables in the code using the ${variable_name} format. On the node configuration tab, you can assign values to the variables in the Scheduling Configuration > Scheduling Parameters section in the right-side navigation pane. This lets you dynamically pass parameters in scheduling scenarios. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters. The following is an example.
SHOW TABLES; -- Define a variable named var using ${var}. If you assign the value ${yyyymmdd} to this variable, you can create a table with the data timestamp as a suffix. CREATE TABLE IF NOT EXISTS userinfo_new_${var} ( ip STRING COMMENT'IP address', uid STRING COMMENT'User ID' )PARTITIONED BY( dt STRING ); -- This can be used with scheduling parameters.
An SQL statement cannot exceed 130 KB.
If multiple EMR computing resources are attached to your workspace in Data Development, you must select a computing resource.
(Optional) Configure advanced parameters
On the Advanced Settings tab of the node, configure Spark-specific attribute parameters. For more information about the Spark attribute parameters, see Spark Configuration. The configurable advanced parameters vary based on the EMR cluster type, as shown in the following tables.
DataLake cluster/Custom cluster: EMR on ECS
Advanced parameter | Configuration description |
queue | The scheduling queue to which the job is submitted. The default queue is `default`. For more information about EMR YARN, see Basic queue configurations. |
priority | The priority. The default value is 1. |
FLOW_SKIP_SQL_ANALYZE | The execution mode of SQL statements. Valid values:
Note This parameter can be used only for testing and running flows in the data development environment. |
ENABLE_SPARKSQL_JDBC | The method for submitting SQL code. Valid values:
|
DATAWORKS_SESSION_DISABLE | Applicable to scenarios where you directly test and run tasks in the development environment. Valid values:
Note When this parameter is set to |
Other | Custom Spark Configuration parameters. Add Spark-specific attribute parameters. The configuration format is as follows: Note
|
EMR Serverless Spark cluster
For information about parameter settings, see Set parameters for submitting Spark jobs.
Advanced parameter | Configuration description |
FLOW_SKIP_SQL_ANALYZE | The execution mode of SQL statements. Valid values:
Note This parameter can be used only for testing and running flows in the data development environment. |
DATAWORKS_SESSION_DISABLE | The job submission method. When you execute a task in Data Development, the task is submitted to SQL Compute for execution by default. You can use this parameter to specify whether the task is executed by SQL Compute or submitted to a queue for execution.
|
SERVERLESS_RELEASE_VERSION | The Spark DPI engine version. By default, the Default DPI Engine Version configured for the cluster in Management Center > Cluster Management is used. To set different DPI engine versions for different tasks, you can set them here. Note The |
SERVERLESS_QUEUE_NAME | Specifies the resource queue to which the task is submitted. When a task is specified to be submitted to a queue for execution, the Default Resource Queue configured for the cluster in Management Center > Cluster Management is used by default. If you have resource fencing and management requirements, you can add queues. For more information, see Manage resource queues. Configuration methods:
Note
|
SERVERLESS_SQL_COMPUTE | Specifies the SQL Compute (SQL session). By default, the Default SQL Compute configured for the cluster in Management Center > Cluster Management is used. To set different SQL sessions for different tasks, you can set them here. To create and manage SQL sessions, see Manage SQL sessions. |
Other | Custom Spark Configuration parameters. Add Spark-specific attribute parameters. The configuration format is as follows: Note DataWorks lets you set global Spark parameters, which means you can specify the Spark parameters used by each DataWorks module at the workspace level. You can specify whether the priority of these global Spark parameters is higher than that of the Spark parameters within a specific module. For more information about how to set global Spark parameters, see Set global Spark parameters. |
Save and run the SQL task
On the toolbar, click the icon to save the SQL statement. Click the
icon to run the SQL statement.
In the Run dialog box, select a resource group that has passed the network connectivity test to ensure that DataWorks can access your Spark service. If you use variables in the node code, assign constants to the variables for testing purposes. For information about how to configure scheduling parameters and resource groups for scheduling, see Configure node scheduling. For more information about how to debug tasks, see Task debugging process.
To modify the parameter assignments in the code, click Advanced Run on the toolbar. For more information about the parameter assignment logic, see What is the difference in assignment logic among Run, Advanced Run, and smoke testing in the development environment?
Spark cluster: EMR on ACK
Advanced parameter | Configuration description |
FLOW_SKIP_SQL_ANALYZE | The execution mode of SQL statements. Valid values:
Note This parameter can be used only for testing and running flows in the data development environment. |
Other | Custom Spark Configuration parameters. Add Spark-specific attribute parameters. The configuration format is as follows: Note DataWorks lets you set global Spark parameters, which means you can specify the Spark parameters used by each DataWorks module at the workspace level. You can specify whether the priority of these global Spark parameters is higher than that of the Spark parameters within a specific module. For more information about how to set global Spark parameters, see Set global Spark parameters. |
Hadoop cluster: EMR on ECS
Advanced parameter | Configuration description |
queue | The scheduling queue to which the job is submitted. The default queue is `default`. For more information about EMR YARN, see Basic queue configurations. |
priority | The priority. The default value is 1. |
FLOW_SKIP_SQL_ANALYZE | The execution mode of SQL statements. Valid values:
Note This parameter can be used only for testing and running flows in the data development environment. |
USE_GATEWAY | Sets whether to submit the job through a Gateway cluster when submitting this node's job. Valid values:
Note If you manually set this parameter to |
Other | Custom Spark Configuration parameters. Add Spark-specific attribute parameters. The configuration format is as follows: Note
|
Execute the SQL task
On the toolbar, click the
icon. In the Parameters dialog box, select the scheduling resource group that you created and click Run.
NoteTo access computing resources in a public network or VPC network environment, you must use a scheduling resource group that has passed the connectivity test with the computing resource. For more information, see Network connectivity solutions.
To change the resource group for subsequent task executions, you can click the Run With Parameters
icon and select the scheduling resource group to which you want to switch.
When you query data using an EMR Spark SQL node, the query can return a maximum of 10,000 records, and the total data size cannot exceed 10 MB.
Click the
icon to save the SQL statement.
(Optional) Perform a smoke test.
If you want to perform a smoke test in the development environment, you can do so when you submit the node or after the node is submitted. For more information, see Perform a smoke test.
3. Configure node scheduling
If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.
You must set the Rerun property and the Dependencies (ancestor nodes) for the node before you submit it.
To customize the component environment, you can create a custom image based on the official image
dataworks_emr_base_task_pod
, and use the image in Data Development.For example, when you create a custom image, you can replace Spark JAR packages or add dependencies for specific
libraries
,files
, orJAR packages
.
4. Publish the node task
After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.
Click the
icon in the top toolbar to save the task.
Click the
icon in the top toolbar to commit the task.
In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
NoteYou must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.
If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.
What to do next
After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see View and manage auto triggered tasks.
FAQ
When you execute an EMR Spark SQL node task in Data Development, if the task must be submitted to SQL Compute for execution, ensure that the status of the SQL Compute service is Running. Otherwise, the task fails. For information about how to view the SQL Compute status, see Manage SQL sessions.