spark

PLEASE NOTE: This document applies to the HEAD of the source tree

If you are using a released version of Kubernetes, you should refer to the docs that go with that version.

The latest 1.0.x release of this document can be found [here](http://releases.k8s.io/release-1.0/examples/spark/README.md).

Documentation for other releases can be found at releases.k8s.io.

Spark example

Following this example, you will create a functional Apache Spark cluster using Kubernetes and Docker.

You will setup a Spark master service and a set of Spark workers using Spark's standalone mode.

For the impatient expert, jump straight to the tl;dr section.

Sources

The Docker images are heavily based on https://github.com/mattf/docker-spark

Step Zero: Prerequisites

This example assumes you have a Kubernetes cluster installed and running, and that you have installed the kubectl command line tool somewhere in your path. Please see the getting started for installation instructions for your platform.

Step One: Start your Master service

The Master service is the master service for a Spark cluster.

Use the examples/spark/spark-master-controller.yaml file to create a replication controller running the Spark Master service.

$ kubectl create -f examples/spark/spark-master-controller.yaml
replicationcontrollers/spark-master-controller

Then, use the examples/spark/spark-master-service.yaml file to create a logical service endpoint that Spark workers can use to access the Master pod.

$ kubectl create -f examples/spark/spark-master-service.yaml
services/spark-master

You can then create a service for the Spark Master WebUI:

$ kubectl create -f examples/spark/spark-webui.yaml
services/spark-webui

Check to see if Master is running and accessible

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
spark-master-controller-5u0q5   1/1       Running   0          8m

Check logs to see the status of the master. (Use the pod retrieved from the previous output.)

$ kubectl logs spark-master-controller-5u0q5
starting org.apache.spark.deploy.master.Master, logging to /opt/spark-1.5.1-bin-hadoop2.6/sbin/../logs/spark--org.apache.spark.deploy.master.Master-1-spark-master-controller-g0oao.out
Spark Command: /usr/lib/jvm/java-8-openjdk-amd64/jre/bin/java -cp /opt/spark-1.5.1-bin-hadoop2.6/sbin/../conf/:/opt/spark-1.5.1-bin-hadoop2.6/lib/spark-assembly-1.5.1-hadoop2.6.0.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-rdbms-3.2.9.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-core-3.2.10.jar:/opt/spark-1.5.1-bin-hadoop2.6/lib/datanucleus-api-jdo-3.2.6.jar -Xms1g -Xmx1g org.apache.spark.deploy.master.Master --ip spark-master --port 7077 --webui-port 8080
========================================
15/10/27 21:25:05 INFO Master: Registered signal handlers for [TERM, HUP, INT]
15/10/27 21:25:05 INFO SecurityManager: Changing view acls to: root
15/10/27 21:25:05 INFO SecurityManager: Changing modify acls to: root
15/10/27 21:25:05 INFO SecurityManager: SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(root); users with modify permissions: Set(root)
15/10/27 21:25:06 INFO Slf4jLogger: Slf4jLogger started
15/10/27 21:25:06 INFO Remoting: Starting remoting
15/10/27 21:25:06 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@spark-master:7077]
15/10/27 21:25:06 INFO Utils: Successfully started service 'sparkMaster' on port 7077.
15/10/27 21:25:07 INFO Master: Starting Spark master at spark://spark-master:7077
15/10/27 21:25:07 INFO Master: Running Spark version 1.5.1
15/10/27 21:25:07 INFO Utils: Successfully started service 'MasterUI' on port 8080.
15/10/27 21:25:07 INFO MasterWebUI: Started MasterWebUI at http://spark-master:8080
15/10/27 21:25:07 INFO Utils: Successfully started service on port 6066.
15/10/27 21:25:07 INFO StandaloneRestServer: Started REST server for submitting applications on port 6066
15/10/27 21:25:07 INFO Master: I have been elected leader! New state: ALIVE

After you know the master is running, you can use the (cluster proxy)[../../docs/user-guide/accessing-the-cluster.md#using-kubectl-proxy] to connect to the Spark WebUI:

kubectl proxy --port=8001

At which point the UI will be available at http://localhost:8001/api/v1/proxy/namespaces/default/services/spark-webui/

Step Two: Start your Spark workers

The Spark workers do the heavy lifting in a Spark cluster. They provide execution resources and data cache capabilities for your program.

The Spark workers need the Master service to be running.

Use the examples/spark/spark-worker-controller.yaml file to create a replication controller that manages the worker pods.

$ kubectl create -f examples/spark/spark-worker-controller.yaml

Check to see if the workers are running

If you launched the Spark WebUI, your workers should just appear in the UI when they're ready. (It may take a little bit to pull the images and launch the pods.) You can also interrogate the status in the following way:

$ kubectl get pods
NAME                            READY     STATUS    RESTARTS   AGE
spark-master-controller-5u0q5   1/1       Running   0          25m
spark-worker-controller-e8otp   1/1       Running   0          6m
spark-worker-controller-fiivl   1/1       Running   0          6m
spark-worker-controller-ytc7o   1/1       Running   0          6m

$ kubectl logs spark-master-controller-5u0q5
[...]
15/10/26 18:20:14 INFO Master: Registering worker 10.244.1.13:53567 with 2 cores, 6.3 GB RAM
15/10/26 18:20:14 INFO Master: Registering worker 10.244.2.7:46195 with 2 cores, 6.3 GB RAM
15/10/26 18:20:14 INFO Master: Registering worker 10.244.3.8:39926 with 2 cores, 6.3 GB RAM

Assuming you still have the kubectl proxy running from the previous section, you should now see the workers in the UI as well. Note: The UI will have links to worker Web UIs. The worker UI links do not work (the links will attempt to connect to cluster IPs, which Kubernetes won't proxy automatically).

Step Three: Start your Spark driver to launch jobs on your Spark cluster

The Spark driver is used to launch jobs into Spark cluster. You can read more about it in Spark architecture.

$ kubectl create -f examples/spark/spark-driver-controller.yaml
replicationcontrollers/spark-driver-controller

The Spark driver needs the Master service to be running.

Check to see if the driver is running

$ kubectl get pods -lcomponent=spark-driver
NAME                            READY     STATUS    RESTARTS   AGE
spark-driver-controller-vwb9c   1/1       Running   0          1m

Step Four: Do something with the cluster

Use the kubectl exec to connect to Spark driver and run a pipeline.

$ kubectl exec spark-driver-controller-vwb9c -it pyspark
Python 2.7.9 (default, Mar  1 2015, 12:57:24)
[GCC 4.9.2] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
      ____              __
     / __/__  ___ _____/ /__
    _\ \/ _ \/ _ `/ __/  '_/
   /__ / .__/\_,_/_/ /_/\_\   version 1.5.1
      /_/

Using Python version 2.7.9 (default, Mar  1 2015 12:57:24)
SparkContext available as sc, HiveContext available as sqlContext.
>>> sc.textFile("gs://dataflow-samples/shakespeare/*").map(lambda s: len(s.split())).sum()
939193

Congratulations, you just counted all of the words in all of the plays of Shakespeare.

Result

You now have services and replication controllers for the Spark master, Spark workers and Spark driver. You can take this example to the next step and start using the Apache Spark cluster you just created, see Spark documentation for more information.

tl;dr

kubectl create -f examples/spark

After it's setup:

kubectl get pods # Make sure everything is running
kubectl proxy --port=8001 # Start an application proxy, if you want to see the Spark WebUI
kubectl get pods -lcomponent=spark-driver # Get the driver pod to interact with.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

spark

spark

README.md

PLEASE NOTE: This document applies to the HEAD of the source tree

Documentation for other releases can be found at releases.k8s.io.

Spark example

Sources

Step Zero: Prerequisites

Step One: Start your Master service

Check to see if Master is running and accessible

Step Two: Start your Spark workers

Check to see if the workers are running

Step Three: Start your Spark driver to launch jobs on your Spark cluster

Check to see if the driver is running

Step Four: Do something with the cluster

Result

tl;dr

Name		Name	Last commit message	Last commit date
parent directory ..
images		images
spark-gluster		spark-gluster
README.md		README.md
spark-driver-controller.yaml		spark-driver-controller.yaml
spark-master-controller.yaml		spark-master-controller.yaml
spark-master-service.yaml		spark-master-service.yaml
spark-webui.yaml		spark-webui.yaml
spark-worker-controller.yaml		spark-worker-controller.yaml

Files

spark

Directory actions

More options

Directory actions

More options

Latest commit

History

spark

Folders and files

parent directory

README.md

PLEASE NOTE: This document applies to the HEAD of the source tree

Documentation for other releases can be found at releases.k8s.io.

Spark example

Sources

Step Zero: Prerequisites

Step One: Start your Master service

Check to see if Master is running and accessible

Step Two: Start your Spark workers

Check to see if the workers are running

Step Three: Start your Spark driver to launch jobs on your Spark cluster

Check to see if the driver is running

Step Four: Do something with the cluster

Result

tl;dr