DataWorks allows you to use Lindorm Spark nodes to develop and periodically schedule Lindorm Spark tasks. This topic describes how to use a Lindorm Spark node to develop a task.
Background information
Lindorm is a distributed computing service based on a cloud native architecture. It supports Community Edition computing models and Apache Spark and is in-depth integrated with the features provided by the Lindorm storage engine. Lindorm can use the underlying data storage features and indexing capabilities to efficiently complete distributed jobs. Lindorm meets computing requirements in various scenarios, such as massive data processing, interactive analytics, machine learning, and graph computing.
Prerequisites
(Required if you use a RAM user) The RAM user that you want to use to develop tasks is added to the required workspace and is assigned the Development or Workspace Administrator role. The Workspace Administrator role has extensive permissions, and we recommend that you assign this role to the RAM user only when necessary. For more information about how to add a member to a workspace and grant permissions to the member, see Add workspace members and assign roles to them.
Note If you use an Alibaba Cloud account, ignore this prerequisite.
A Lindorm instance is created and associated with the required workspace as a computing resource. For more information, see Add a Lindorm computing resource.
Configure the Lindorm Spark node
On the configuration tab of the Lindorm Spark node, you can configure the node by using a JAR package or .py
file based on the language type (such as Java, Scala, and Python).
Configure the Lindorm Spark node in Java or Scala
In the following example, the sample program SparkPi is used to describe how to configure and use a Lindorm Spark node.
Upload a JAR package
You must upload a sample JAR package to LindormDFS and copy the storage location of the JAR package so that you can run the JAR package in node configuration.
Prepare a sample JAR package.
Download the sample JAR package spark-examples_2.12-3.3.0.jar to your computer.
Upload the JAR package to LindormDFS.
Log on to the Lindorm console. In the top navigation bar, select a desired region. Find the Lindorm instance that you created on the Instances page.
Click the name of the instance in the Instance ID/Name column to go to the instance details page.
In the left-side navigation pane, click Compute Engine.
On the Job Management tab of the Compute Engine page, click Upload Resource.
On the xxx page, click the dashed line box area for uploading, find and open the JAR package resource you downloaded.
Click Upload.
Copy the storage path of the sample JAR package.
On the Job Management tab, find the JAR package that you uploaded below Upload Resource. Click the
icon to the left of the package to copy the storage path of the JAR package in LindormDFS.
Configure the Lindorm Spark node
You can configure the parameters that are described in the following table to configure the Lindorm Spark node.
Language | Parameter | Description |
Java or Scala | Main JAR Resource | Enter the storage path of the sample JAR package that you copied in the Upload a JAR package section. |
Main Class | The main class of the task in the compiled JAR package. The name of the main class in sample code is org.apache.spark.examples.SparkPi . |
Parameters | The parameters that you want to configure in the code. You can configure parameters as dynamic parameters in the ${var} format. |
Configuration Items | You can configure Spark program runtime parameters here. For more information about Spark property parameter settings, see Job configuration instructions. |
Configure the Lindorm Spark node in Python
In the following example, a sample Spark program is used to calculate the value of pi (π) and describe how to configure and use a Lindorm Spark node.
Upload the Python resource
You must upload the sample Python resource to LindormDFS and copy the storage location of the JAR package so that you can run the JAR package in node configuration.
Create a Python resource.
Save the following Python scripts as a local file and name it pi.py
.
import sys from random import random from operator import add from pyspark.sql import SparkSession if __name__ == "__main__": """ Usage: pi [partitions] """ spark = SparkSession\ .builder\ .appName("PythonPi")\ .getOrCreate() partitions = int(sys.argv[1]) if len(sys.argv) > 1 else 2 n = 100000 * partitions def f(_: int) -> float: x = random() * 2 - 1 y = random() * 2 - 1 return 1 if x ** 2 + y ** 2 <= 1 else 0 count = spark.sparkContext.parallelize(range(1, n + 1), partitions).map(f).reduce(add) print("Pi is roughly %f" % (4.0 * count / n)) spark.stop()
Upload the Python resource to LindormDFS.
Log on to the Lindorm console. In the top navigation bar, select a desired region. Find the Lindorm instance that you created on the Instances page.
Click the name of the instance in the Instance ID/Name column to go to the instance details page.
In the left-side navigation pane, click Compute Engine.
On the Job Management tab of the Compute Engine page, click Upload Resource.
Click the dotted box. In the dialog box that appears, select the Python resource that you downloaded and click Open.
Click Upload.
Copy the storage path of the sample JAR package.
On the Job Management tab, find the JAR package that you uploaded below Upload Resource. Click the
icon to the left of the package to copy the storage path of the JAR package in LindormDFS.
Configure the Lindorm Spark node
You can configure the parameters that are described in the following table to configure the Lindorm Spark node.
Language | Parameter | Description |
Python | Main Package | The storage location of the sample code file that you copied in Upload the Python resource. |
Parameters | The parameters that you want to configure in the code. You can configure parameters as dynamic parameters in the ${var} format. |
Configuration Items | You can configure Spark program runtime parameters here. For more information about Spark property parameter settings, see Job configuration instructions. |
Debug the Lindorm Spark node
Configure debugging properties for the Lindorm Spark node.
On the Debugging Configurations tab in the right-side navigation pane of the configuration tab of the Lindorm Spark node, configure the parameters that are described in the following table.
Parameter | Description |
Computing Resource | Select the Lindorm computing resource that you associate with the workspace. |
Lindorm Resource Group | Select the Lindorm resource group that you specify when you associate the Lindorm computing resource with the workspace. |
Resource Group | Select the resource group that has passed the connectivity test when you associate the Lindorm computing resource with the workspace. |
Script Parameters | If you define variables in the ${Parameter name} format when you configure the Lindorm Spark node, you must configure the Parameter Name and Parameter Value parameters in the Script Parameters section. When the Lindorm Spark node is run, the parameters that you configure will be dynamically replaced with actual values. For more information, see Supported formats of scheduling parameters. |
Debug and run the Lindorm Spark node.
Save and run the node.
What to do next
Node scheduling: If you want the system to periodically schedule a node in a workspace directory, you can click Properties in the right-side navigation pane of the configuration tab of the node and configure scheduling properties for the node in the Scheduling Policies section.
Node deployment: If you want to deploy a node to the production environment for running, you can click the
icon in the top toolbar of the configuration tab of the node to initiate a deployment process. Nodes in a workspace directory can be periodically scheduled only if they are deployed to the production environment.