
Apache Spark cluster manager types
As discussed previously, Apache Spark currently supports three Cluster managers:
- Standalone cluster manager
- ApacheMesos
- Hadoop YARN
We'll look at setting these up in much more detail in Chapter 8, Operating in Clustered Mode, which talks about the operation in a clustered mode.
Building standalone applications with Apache Spark
Until now we have used Spark for exploratory analysis, using Scala and Python shells. Spark can also be used in standalone applications that can run in Java, Scala, Python, or R. As we saw earlier, Spark shell and PySpark
provide you with a SparkContext.
However, when you are using an application, you need to initialize your own SparkContext.
Once you have a SparkContext
reference, the remaining API remains exactly the same as for interactive query analysis. After all, it's the same object, just a different context in which it is running.
The exact method of using Spark with your application differs based on your preference of language. All Spark artifacts are hosted in Maven central. You can add a Maven dependency with the following coordinates:
groupId: org.apache.spark artifactId: spark_core_2.10 version: 1.6.1
You can use Maven to build the project or alternatively use Scale/Eclipse IDE to add a Maven dependency to your project.
Note
Apache Maven is a build automation tool used primarily for Java projects. The word maven means "accumulator of knowledge" in Yiddish. Maven addresses the two core aspects of building software: first, it describes how the software is built and second, it describes its dependencies.
You can configure your IDE's to work with Spark. While many of the Spark developers use SBT or Maven on the command line, the most common IDE being used is IntelliJ IDEA. Community edition is free, and after that you can install JetBrains Scala Plugin. You can find detailed instructions on setting up either IntelliJIDEA or Eclipse to build Spark at http://bit.ly/28RDPFy.
Submitting applications
The spark submit script in Spark's bin
directory, being the most commonly used method to submit Spark applications to a Spark cluster, can be used to launch applications on all supported cluster types. You will need to package all dependent projects with your application to enable Spark to distribute that across the cluster. You would need to create an assembly JAR (aka uber/fat JAR) file containing all of your code and relevant dependencies.
A spark application with its dependencies can be launched using the bin/spark-submit script. This script takes care of setting up the classpath and its dependencies, and it supports all the cluster-managers and deploy modes supported by Spark.

Figure 1.16: Spark submission template
For Python applications:
- Instead of
<application-jar>
, simply pass in your.py
file. - Add Python
.zip
,.egg
, and.py
files to the search path with -.py
files.
Deployment strategies
- Client mode: This is commonly used when your application is located near to your cluster. In this mode, the driver application is launched as a part of the spark-submit process, which acts as a client to the cluster. The input and output of the application is passed on to the console. The mode is suitable when your gateway machine is physically collocated with your worker machines, and is used in applications such as Spark shell that involve REPL. This is the default mode.
- Cluster mode: This is useful when your application is submitted from a machine far from the worker machines (for example, locally on your laptop) to minimize network latency between the drivers and the executors. Currently only YARN supports cluster mode for Python applications. The following table shows that the combination of cluster managers, deployment managers, and usage that are not supported in Spark 2.0.0:
