- A Scala tool which helps deploying Apache Spark stand-alone cluster on EC2 and submitting job.
- Currently supports Spark 2.0.0+.
- There are two modes when using spark-deployer: SBT plugin mode and embedded mode.
Here are the basic steps to run a Spark job (all the sbt commands support TAB-completion):
- Set the environment variables
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
. - Prepare a project with structure like below:
project-root ├── build.sbt ├── project │ └── plugins.sbt └── src └── main └── scala └── mypackage └── Main.scala
- Add one line in
project/plugins.sbt
:
addSbtPlugin("net.pishen" % "spark-deployer-sbt" % "3.0.2")
- Write your Spark project's
build.sbt
(Here we give a simple example):
name := "my-project-name" scalaVersion := "2.11.8" libraryDependencies ++= Seq( "org.apache.spark" %% "spark-core" % "2.0.0" % "provided" )
- Write your job's algorithm in
src/main/scala/mypackage/Main.scala
:
package mypackage import org.apache.spark._ object Main { def main(args: Array[String]) { //setup spark val sc = new SparkContext(new SparkConf()) //your algorithm val n = 10000000 val count = sc.parallelize(1 to n).map { i => val x = scala.math.random val y = scala.math.random if (x * x + y * y < 1) 1 else 0 }.reduce(_ + _) println("Pi is roughly " + 4.0 * count / n) } }
- Enter
sbt
, and build a config by:
> sparkBuildConfig
(Most settings have default values, just hit Enter to go through it.)
- Create a cluster with 1 master and 2 workers by:
> sparkCreateCluster 2
- See your cluster's status by:
> sparkShowMachines
- Submit your job by:
> sparkSubmit
- When your job is done, destroy your cluster with
> sparkDestroyCluster
-
To build config with different name or build a config based on old one:
> sparkBuildConfig <new-config-name> > sparkBuildConfig <new-config-name> from <old-config-name>
All the configs are stored as
.deployer.json
files in theconf/
folder. You can modify it if you know what you're doing. -
To change the current config:
> sparkChangeConfig <config-name>
-
To submit a job with arguments or with a main class:
> sparkSubmit <args> > sparkSubmitMain mypackage.Main <args>
-
To add or remove worker machines dynamically:
> sparkAddWorkers <num-of-workers> > sparkRemoveWorkers <num-of-workers>
If you don't want to use sbt, or if you would like to trigger the cluster creation from within your Scala application, you can include the library of spark-deployer directly:
libraryDependencies += "net.pishen" %% "spark-deployer-core" % "3.0.2"
Then, from your Scala code, you can do something like this:
import sparkdeployer._ // build a ClusterConf val clusterConf = ClusterConf.build() // save and load ClusterConf clusterConf.save("path/to/conf.deployer.json") val clusterConfReloaded = ClusterConf.load("path/to/conf.deployer.json") // create cluster and submit job val sparkDeployer = new SparkDeployer()(clusterConf) val workers = 2 sparkDeployer.createCluster(workers) val jar = new File("path/to/job.jar") val mainClass = "mypackage.Main" val args = Seq("arg0", "arg1") sparkDeployer.submit(jar, mainClass, args) sparkDeployer.destroyCluster()
- Environment variables
AWS_ACCESS_KEY_ID
andAWS_SECRET_ACCESS_KEY
should also be set. - You may prepare the
job.jar
by sbt-assembly from other sbt project with Spark. - For other available functions, check
SparkDeployer.scala
in our source code.
spark-deployer uses slf4j, remember to add your own backend to see the log. For example, to print the log on screen, add
libraryDependencies += "org.slf4j" % "slf4j-simple" % "1.7.14"
Yes, just specify the ami id when running sparkBuildConfig
. The image should be HVM EBS-Backed with Java 7+ installed. You can also run some commands before Spark start on each machine by editing the preStartCommands
in json config. For example:
"preStartCommands": [ "sudo bash -c \"echo -e 'LC_ALL=en_US.UTF-8\\nLANG=en_US.UTF-8' >> /etc/environment\"", "sudo apt-get -qq install openjdk-8-jre", "cd spark/conf/ && cp log4j.properties.template log4j.properties && echo 'log4j.rootCategory=WARN, console' >> log4j.properties" ]
When using custom ami, the root device
should be your root volume's name (/dev/sda1
for Ubuntu) that can be enlarged by disk size
settings in master and workers.
Yes, just change the tgz url when running sparkBuildConfig
, the tgz will be extracted as a spark/
folder in each machine's home folder.
Assuming your security group id is sg-abcde123
, the basic settings is:
Type | Protocol | Port Range | Source |
---|---|---|---|
All traffic | All | All | sg-abcde123 |
SSH | TCP | 22 | <your-allowed-ip> |
Custom TCP Rule | TCP | 8080-8081 | <your-allowed-ip> |
Custom TCP Rule | TCP | 4040 | <your-allowed-ip> |
Change to the config you want to upgrade, and run sparkUpgradeConfig
to build a new config based on settings in old one. If this doesn't work or you don't mind rebuilding one from scratch, it's recommended to directly create a new config by sparkBuildConfig
.
You can change it by add the following line to your build.sbt
:
sparkConfigDir := "path/to/my-config-dir"
- Please report issue or ask on gitter if you meet any problem.
- Pull requests are welcome.