Lindorm Spark node - DataWorks - Alibaba Cloud Documentation Center

DataWorks allows you to use Lindorm Spark nodes to develop and periodically schedule Lindorm Spark tasks. This topic describes how to use a Lindorm Spark node to develop a task.

Background information

Lindorm is a distributed computing service based on a cloud native architecture. It supports Community Edition computing models and Apache Spark and is in-depth integrated with the features provided by the Lindorm storage engine. Lindorm can use the underlying data storage features and indexing capabilities to efficiently complete distributed jobs. Lindorm meets computing requirements in various scenarios, such as massive data processing, interactive analytics, machine learning, and graph computing.

Prerequisites

(Required if you use a RAM user) The RAM user that you want to use to develop tasks is added to the required workspace and is assigned the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions, and we recommend that you assign this role to the RAM user only when necessary. For more information about how to add a member to a workspace and grant permissions to the member, see Add workspace members and assign roles to them.
Note
If you use an Alibaba Cloud account, ignore this prerequisite.
A Lindorm instance is created and associated with the required workspace as a computing resource. For more information, see Add a Lindorm computing resource.

Create a Lindorm Spark node

For information about how to create a Lindorm Spark node, see Create a Lindorm Spark node.

Configure the Lindorm Spark node

On the configuration tab of the Lindorm Spark node, you can configure the node by using a JAR package or .py file based on the language type (such as Java, Scala, and Python).

Configure the Lindorm Spark node in Java or Scala

In the following example, the sample program SparkPi is used to describe how to configure and use a Lindorm Spark node.

Upload a JAR package

You must upload a sample JAR package to LindormDFS and copy the storage location of the JAR package so that you can run the JAR package in node configuration.

Prepare a sample JAR package.
Download the sample JAR package spark-examples_2.12-3.3.0.jar to your computer.
Upload the JAR package to LindormDFS.
1. Log on to the Lindorm console. In the top navigation bar, select a desired region. Find the Lindorm instance that you created on the Instances page.
2. Click the name of the instance in the Instance ID/Name column to go to the instance details page.
3. In the left-side navigation pane, click Compute Engine.
4. On the Job Management tab of the Compute Engine page, click Upload Resource.
5. On the xxx page, click the dashed line box area for uploading, find and open the JAR package resource you downloaded.
6. Click Upload.
Copy the storage path of the sample JAR package.
On the Job Management tab, find the JAR package that you uploaded below Upload Resource. Click the icon to the left of the package to copy the storage path of the JAR package in LindormDFS.

Configure the Lindorm Spark node

You can configure the parameters that are described in the following table to configure the Lindorm Spark node.

Language	Parameter	Description
Java or Scala	Main JAR Resource	Enter the storage path of the sample JAR package that you copied in the Upload a JAR package section.
	Main Class	The main class of the task in the compiled JAR package. The name of the main class in sample code is `org.apache.spark.examples.SparkPi`.
	Parameters	The parameters that you want to configure in the code. You can configure parameters as dynamic parameters in the `${var}` format.
	Configuration Items	You can configure Spark program runtime parameters here. For more information about Spark property parameter settings, see Job configuration instructions. Note You can configure global Spark parameters when you add the Lindorm computing resource.

Configure the Lindorm Spark node in Python

In the following example, a sample Spark program is used to calculate the value of pi (π) and describe how to configure and use a Lindorm Spark node.

Upload the Python resource

You must upload the sample Python resource to LindormDFS and copy the storage location of the JAR package so that you can run the JAR package in node configuration.

Create a Python resource.

Save the following Python scripts as a local file and name it pi.py.

import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 n = 100000 * partitions def f(_: int) -> float: x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) spark.stop()

Upload the Python resource to LindormDFS.
1. Log on to the Lindorm console. In the top navigation bar, select a desired region. Find the Lindorm instance that you created on the Instances page.
2. Click the name of the instance in the Instance ID/Name column to go to the instance details page.
3. In the left-side navigation pane, click Compute Engine.
4. On the Job Management tab of the Compute Engine page, click Upload Resource.
5. Click the dotted box. In the dialog box that appears, select the Python resource that you downloaded and click Open.
6. Click Upload.
Copy the storage path of the sample JAR package.
On the Job Management tab, find the JAR package that you uploaded below Upload Resource. Click the icon to the left of the package to copy the storage path of the JAR package in LindormDFS.

Configure the Lindorm Spark node

You can configure the parameters that are described in the following table to configure the Lindorm Spark node.

Language	Parameter	Description
Python	Main Package	The storage location of the sample code file that you copied in Upload the Python resource.
	Parameters	The parameters that you want to configure in the code. You can configure parameters as dynamic parameters in the `${var}` format.
	Configuration Items	You can configure Spark program runtime parameters here. For more information about Spark property parameter settings, see Job configuration instructions.

Debug the Lindorm Spark node

Configure debugging properties for the Lindorm Spark node.

On the Debugging Configurations tab in the right-side navigation pane of the configuration tab of the Lindorm Spark node, configure the parameters that are described in the following table.

Parameter	Description
Computing Resource	Select the Lindorm computing resource that you associate with the workspace.
Lindorm Resource Group	Select the Lindorm resource group that you specify when you associate the Lindorm computing resource with the workspace.
Resource Group	Select the resource group that has passed the connectivity test when you associate the Lindorm computing resource with the workspace.
Script Parameters	If you define variables in the ${Parameter name} format when you configure the Lindorm Spark node, you must configure the Parameter Name and Parameter Value parameters in the Script Parameters section. When the Lindorm Spark node is run, the parameters that you configure will be dynamically replaced with actual values. For more information, see Supported formats of scheduling parameters.

Debug and run the Lindorm Spark node.
Save and run the node.

What to do next

Node scheduling: If you want the system to periodically schedule a node in a workspace directory, you can click Properties in the right-side navigation pane of the configuration tab of the node and configure scheduling properties for the node in the Scheduling Policies section.
Node deployment: If you want to deploy a node to the production environment for running, you can click the icon in the top toolbar of the configuration tab of the node to initiate a deployment process. Nodes in a workspace directory can be periodically scheduled only if they are deployed to the production environment.