Create an EMR Spark SQL node - DataWorks - Alibaba Cloud Documentation Center

Create an E-MapReduce (EMR) Spark SQL node to process structured data using a distributed SQL query engine. This improves job execution efficiency.

Prerequisites

Before developing nodes, if you need to customize the component environment, you can create a custom image based on the official image dataworks_emr_base_task_pod and use the custom image in DataStudio.
For example, you can replace Spark JAR packages or include specific libraries, files, or JAR packages when creating the custom image.
An EMR cluster is registered with DataWorks. For more information, see Legacy Data Development: Attach EMR computing resources.
A Resource Access Management (RAM) user is required to develop tasks. This RAM user must be added to the workspace and granted the Developer or Workspace Administrator role. The Workspace Administrator role has extensive permissions and must be granted with caution. For more information about how to add members, see Add members to a workspace.
A resource group is purchased and configured. This includes attaching the resource group to a workspace and configuring the network. For more information, see Add and use a Serverless resource group.
A business flow is created. In DataStudio, development operations for different engines are organized within business flows. You must create a business flow before you can create a node. For more information, see Create a business flow.
If you require a specific development environment, you can use the custom image feature in DataWorks to build a component image for task execution. For more information, see Custom images.

Limits

This type of node can be run only on a serverless resource group or an exclusive resource group for scheduling. We recommend that you use a serverless resource group. If you need to use an image in DataStudio, use a serverless computing resource group.
To manage the metadata of DataLake or custom clusters in DataWorks, EMR-HOOK must be configured on the cluster. If EMR-HOOK is not configured, you cannot view metadata in real time, generate audit logs, view data lineage, or perform EMR-related administration tasks in DataWorks. For more information about how to configure EMR-HOOK, see Configure EMR-HOOK for Spark SQL.
Data lineage is not available for EMR on ACK Spark clusters. Data lineage is available for EMR Serverless Spark clusters.
You can use the visualization feature to register functions for DataLake clusters and custom clusters, but not for EMR on ACK Spark clusters or EMR Serverless Spark clusters.

Precautions

If you have enabled Ranger permission control for the Spark component in the EMR cluster that is associated with your workspace, take note of the following items:

When you use the default image to run Spark tasks, Ranger permission control takes effect by default.
If you want to use a custom image to run Spark tasks, you must submit a ticket to contact technical support to upgrade the image. This way, Ranger permission control can take effect when you run Spark tasks.

1. Create an EMR Spark SQL node

Go to the DataStudio page.
Log on to the DataWorks console. In the top navigation bar, select the desired region. In the left-side navigation pane, choose Data Development and O&M > Data Development. On the page that appears, select the desired workspace from the drop-down list and click Go to Data Development.
Create an EMR Spark SQL node.
1. Right-click the target business flow and choose New Node > EMR > EMR Spark SQL.
  Note
  Alternatively, you can hover over New and choose New Node > EMR > EMR Spark SQL.
2. In the New Node dialog box, set Name, Engine Instance, Node Type, and Path. Click Confirm. The configuration tab of the EMR Spark SQL node appears.
  Note
  The node name can contain uppercase letters, lowercase letters, Chinese characters, digits, underscores (_), and periods (.).

2. Develop an EMR Spark SQL task

On the EMR Spark SQL node configuration tab, double-click the node that you created. The task development tab appears.

Develop SQL code

In the SQL editing area, develop the task code. You can define variables in the code using the ${variable_name} format. On the node configuration tab, you can assign values to the variables in the Scheduling Configuration > Scheduling Parameters section in the right-side navigation pane. This lets you dynamically pass parameters in scheduling scenarios. For more information about how to use scheduling parameters, see Supported formats of scheduling parameters. The following is an example.

SHOW TABLES; -- Define a variable named var using ${var}. If you assign the value ${yyyymmdd} to this variable, you can create a table with the data timestamp as a suffix. CREATE TABLE IF NOT EXISTS userinfo_new_${var} ( ip STRING COMMENT'IP address', uid STRING COMMENT'User ID' )PARTITIONED BY( dt STRING ); -- This can be used with scheduling parameters.

Note

An SQL statement cannot exceed 130 KB.
If multiple EMR computing resources are attached to your workspace in Data Development, you must select a computing resource.

(Optional) Configure advanced parameters

On the Advanced Settings tab of the node, configure Spark-specific attribute parameters. For more information about the Spark attribute parameters, see Spark Configuration. The configurable advanced parameters vary based on the EMR cluster type, as shown in the following tables.

DataLake cluster/Custom cluster: EMR on ECS

Advanced parameter	Configuration description
queue	The scheduling queue to which the job is submitted. The default queue is `default`. For more information about EMR YARN, see Basic queue configurations.
priority	The priority. The default value is 1.
FLOW_SKIP_SQL_ANALYZE	The execution mode of SQL statements. Valid values: `true`: Multiple SQL statements are executed at a time. `false` (default): One SQL statement is executed at a time. Note This parameter can be used only for testing and running flows in the data development environment.
ENABLE_SPARKSQL_JDBC	The method for submitting SQL code. Valid values: `true`: Submits SQL code using Java Database Connectivity (JDBC). If the EMR cluster does not have the Kyuubi service, the SQL code is submitted to the Spark Thrift Server. If the EMR cluster has the Kyuubi service, the SQL code is submitted to Kyuubi using JDBC, and custom Spark parameters are supported. Both methods support metadata lineage, but tasks submitted to the Thrift Server will lack the output information of the corresponding node task in the metadata. `false` (default): Submits SQL code using the `spark-submit cluster` method. In this submission mode, both Spark 2 and Spark 3 support metadata lineage and output information. Custom Spark parameters are also supported. Note The Spark-submit cluster submission mode creates temporary files and directories in the `/tmp` directory on the HDFS of an EMR cluster by default. You must ensure that this directory has read and write permissions. If you select the Spark-submit cluster mode for submission, you can directly add custom SparkConf parameters in the advanced configuration. After you commit the code, DataWorks automatically adds the new parameters to the command. For example, `"spark.driver.memory" : "2g"`.
DATAWORKS_SESSION_DISABLE	Applicable to scenarios where you directly test and run tasks in the development environment. Valid values: `true`: A new JDBC connection is created each time an SQL statement is run. `false` (default): The same JDBC connection is reused when a user runs different SQL statements in a node. Note When this parameter is set to `false`, the Hive `yarn applicationId` is not printed. To print the `yarn applicationId`, set this parameter to `true`.
Other	Custom Spark Configuration parameters. Add Spark-specific attribute parameters. The configuration format is as follows: `"spark.eventLog.enabled":false` . DataWorks automatically adds the configuration to the code that is delivered to the EMR cluster in the following format: `--conf key=value`. For more information about parameter configuration, see Set global Spark parameters. Note DataWorks lets you set global Spark parameters, which means you can specify the Spark parameters used by each DataWorks module at the workspace level. You can specify whether the priority of these global Spark parameters is higher than that of the Spark parameters within a specific module. To enable Ranger access control, add the configuration `spark.hadoop.fs.oss.authorization.method=ranger` in Set global Spark parameters to ensure that Ranger access control takes effect.

EMR Serverless Spark cluster

For information about parameter settings, see Set parameters for submitting Spark jobs.

Advanced parameter	Configuration description
FLOW_SKIP_SQL_ANALYZE	The execution mode of SQL statements. Valid values: `true`: Multiple SQL statements are executed at a time. `false` (default): One SQL statement is executed at a time. Note This parameter can be used only for testing and running flows in the data development environment.
DATAWORKS_SESSION_DISABLE	The job submission method. When you execute a task in Data Development, the task is submitted to SQL Compute for execution by default. You can use this parameter to specify whether the task is executed by SQL Compute or submitted to a queue for execution. `true`: The task is submitted to a queue for execution. The default queue that is specified when you associate a computing resource is used by default. When the `DATAWORKS_SESSION_DISABLE` parameter is set to `true`, you can configure the `SERVERLESS_QUEUE_NAME` parameter to specify the queue to which tasks are submitted for execution in Data Development. `false` (default): The task is submitted to SQL Compute for execution. Note This parameter takes effect only during execution in Data Development and does not take effect during scheduled runtime.
SERVERLESS_RELEASE_VERSION	The Spark DPI engine version. By default, the Default DPI Engine Version configured for the cluster in Management Center > Cluster Management is used. To set different DPI engine versions for different tasks, you can set them here. Note The `SERVERLESS_RELEASE_VERSION` parameter in the Advanced Settings configuration takes effect only when the SQL Compute (session) specified by the registered cluster is not started in the EMR Serverless Spark console.
SERVERLESS_QUEUE_NAME	Specifies the resource queue to which the task is submitted. When a task is specified to be submitted to a queue for execution, the Default Resource Queue configured for the cluster in Management Center > Cluster Management is used by default. If you have resource fencing and management requirements, you can add queues. For more information, see Manage resource queues. Configuration methods: Set the node parameter to specify the resource queue for task submission. Specify the resource queue for task submission by setting global Spark parameters. Note The `SERVERLESS_QUEUE_NAME` parameter in the Advanced Settings takes effect only when the SQL Compute (session) specified for the registered cluster is not started in the EMR Serverless Spark console. When you run a job in Data Development: You must first set `DATAWORKS_SESSION_DISABLE` to `true` so that the job is submitted to a queue for execution. Only then does the `SERVERLESS_QUEUE_NAME` parameter that you configure to specify the job queue take effect. During scheduled execution in Operation Center: The task is forcibly submitted to a queue for execution and cannot be submitted to SQL Compute for execution.
SERVERLESS_SQL_COMPUTE	Specifies the SQL Compute (SQL session). By default, the Default SQL Compute configured for the cluster in Management Center > Cluster Management is used. To set different SQL sessions for different tasks, you can set them here. To create and manage SQL sessions, see Manage SQL sessions.
Other	Custom Spark Configuration parameters. Add Spark-specific attribute parameters. The configuration format is as follows: `"spark.eventLog.enabled":"false"`. DataWorks automatically completes the code in the format `--conf key=value` in the code that is finally sent to the EMR cluster. Note DataWorks lets you set global Spark parameters, which means you can specify the Spark parameters used by each DataWorks module at the workspace level. You can specify whether the priority of these global Spark parameters is higher than that of the Spark parameters within a specific module. For more information about how to set global Spark parameters, see Set global Spark parameters.

Save and run the SQL task

On the toolbar, click the icon to save the SQL statement. Click the icon to run the SQL statement.

In the Run dialog box, select a resource group that has passed the network connectivity test to ensure that DataWorks can access your Spark service. If you use variables in the node code, assign constants to the variables for testing purposes. For information about how to configure scheduling parameters and resource groups for scheduling, see Configure node scheduling. For more information about how to debug tasks, see Task debugging process.

Note

To modify the parameter assignments in the code, click Advanced Run on the toolbar. For more information about the parameter assignment logic, see What is the difference in assignment logic among Run, Advanced Run, and smoke testing in the development environment?

Spark cluster: EMR on ACK

Advanced parameter

Configuration description

FLOW_SKIP_SQL_ANALYZE

The execution mode of SQL statements. Valid values:

true: Multiple SQL statements are executed at a time.
false (default): One SQL statement is executed at a time.

Note

This parameter can be used only for testing and running flows in the data development environment.

Other

Custom Spark Configuration parameters. Add Spark-specific attribute parameters.

The configuration format is as follows: "spark.eventLog.enabled":false. DataWorks automatically completes the code in the format --conf key=value in the code that is finally sent to the EMR cluster.

Note

DataWorks lets you set global Spark parameters, which means you can specify the Spark parameters used by each DataWorks module at the workspace level. You can specify whether the priority of these global Spark parameters is higher than that of the Spark parameters within a specific module. For more information about how to set global Spark parameters, see Set global Spark parameters.

Hadoop cluster: EMR on ECS

Advanced parameter	Configuration description
queue	The scheduling queue to which the job is submitted. The default queue is `default`. For more information about EMR YARN, see Basic queue configurations.
priority	The priority. The default value is 1.
FLOW_SKIP_SQL_ANALYZE	The execution mode of SQL statements. Valid values: `true`: Multiple SQL statements are executed at a time. `false` (default): One SQL statement is executed at a time. Note This parameter can be used only for testing and running flows in the data development environment.
USE_GATEWAY	Sets whether to submit the job through a Gateway cluster when submitting this node's job. Valid values: `true`: Submit through a Gateway cluster. `false` (default): Do not submit through a Gateway cluster. The job is submitted to the header node by default. Note If you manually set this parameter to `true` for a node in a cluster that is not associated with a Gateway cluster, subsequent EMR job submissions will fail.
Other	Custom Spark Configuration parameters. Add Spark-specific attribute parameters. The configuration format is as follows: `"spark.eventLog.enabled":false`. DataWorks automatically converts this configuration to the `--conf key=value` format in the code that is submitted to the EMR cluster. For more information about parameter configuration, see Setting Global Spark Parameters. Note DataWorks lets you set global Spark parameters, which means you can specify the Spark parameters used by each DataWorks module at the workspace level. You can specify whether the priority of these global Spark parameters is higher than that of the Spark parameters within a specific module. To enable Ranger access control, add the `spark.hadoop.fs.oss.authorization.method=ranger` configuration in Set global Spark parameters to ensure that Ranger access control takes effect.

Execute the SQL task

On the toolbar, click the icon. In the Parameters dialog box, select the scheduling resource group that you created and click Run.
Note
- To access computing resources in a public network or VPC network environment, you must use a scheduling resource group that has passed the connectivity test with the computing resource. For more information, see Network connectivity solutions.
- To change the resource group for subsequent task executions, you can click the Run With Parameters icon and select the scheduling resource group to which you want to switch.
- When you query data using an EMR Spark SQL node, the query can return a maximum of 10,000 records, and the total data size cannot exceed 10 MB.
Click the icon to save the SQL statement.
(Optional) Perform a smoke test.
If you want to perform a smoke test in the development environment, you can do so when you submit the node or after the node is submitted. For more information, see Perform a smoke test.

3. Configure node scheduling

If you want the system to periodically run a task on the node, you can click Properties in the right-side navigation pane on the configuration tab of the node to configure task scheduling properties based on your business requirements. For more information, see Overview.

Note

You must set the Rerun property and the Dependencies (ancestor nodes) for the node before you submit it.
To customize the component environment, you can create a custom image based on the official image dataworks_emr_base_task_pod, and use the image in Data Development.
For example, when you create a custom image, you can replace Spark JAR packages or add dependencies for specific libraries, files, or JAR packages.

4. Publish the node task

After a task on a node is configured, you must commit and deploy the task. After you commit and deploy the task, the system runs the task on a regular basis based on scheduling configurations.

Click the icon in the top toolbar to save the task.
Click the icon in the top toolbar to commit the task.
In the Submit dialog box, configure the Change description parameter. Then, determine whether to review task code after you commit the task based on your business requirements.
Note
- You must configure the Rerun and Parent Nodes parameters on the Properties tab before you commit the task.
- You can use the code review feature to ensure the code quality of tasks and prevent task execution errors caused by invalid task code. If you enable the code review feature, the task code that is committed can be deployed only after the task code passes the code review. For more information, see Code review.

If you use a workspace in standard mode, you must deploy the task in the production environment after you commit the task. To deploy a task on a node, click Deploy in the upper-right corner of the configuration tab of the node. For more information, see Deploy nodes.

What to do next

After you commit and deploy the task, the task is periodically run based on the scheduling configurations. You can click Operation Center in the upper-right corner of the configuration tab of the corresponding node to go to Operation Center and view the scheduling status of the task. For more information, see View and manage auto triggered tasks.

FAQ

Why does the DlfMetaStoreClientFactory not found error occur when I use the YARN-Cluster mode to run spark-submit in an EMR Spark node after Kerberos is enabled in the EMR cluster?
When you execute an EMR Spark SQL node task in Data Development, if the task must be submitted to SQL Compute for execution, ensure that the status of the SQL Compute service is Running. Otherwise, the task fails. For information about how to view the SQL Compute status, see Manage SQL sessions.