Following this example, you will create a functional Apache Spark cluster using Kubernetes and Docker.
You will setup a Spark master service and a set of Spark workers using Spark's standalone mode.
For the impatient expert, jump straight to the tl;dr section.
Source is freely available at:
- Docker image - https://github.com/mattf/docker-spark
- Docker Trusted Build - https://registry.hub.docker.com/search?q=mattf/spark
This example assumes you have a Kubernetes cluster installed and
running, and that you have installed the kubectl
command line
tool somewhere in your path. Please see the getting
started for installation
instructions for your platform.
The Master service is the master (or head) service for a Spark cluster.
Use the examples/spark/spark-master.json
file to create a pod running
the Master service.
$ kubectl create -f examples/spark/spark-master.json
Then, use the examples/spark/spark-master-service.json
file to
create a logical service endpoint that Spark workers can use to access
the Master pod.
$ kubectl create -f examples/spark/spark-master-service.json
Ensure that the Master service is running and functional.
$ kubectl get pods,services
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS
spark-master 192.168.90.14 spark-master mattf/spark-master 172.18.145.8/172.18.145.8 name=spark-master Running
NAME LABELS SELECTOR IP PORT
kubernetes component=apiserver,provider=kubernetes <none> 10.254.0.2 443
kubernetes-ro component=apiserver,provider=kubernetes <none> 10.254.0.1 80
spark-master name=spark-master name=spark-master 10.254.125.166 7077
Connect to http://192.168.90.14:8080 to see the status of the master.
$ links -dump 192.168.90.14:8080
[IMG] 1.2.1 Spark Master at spark://spark-master:7077
* URL: spark://spark-master:7077
* Workers: 0
* Cores: 0 Total, 0 Used
* Memory: 0.0 B Total, 0.0 B Used
* Applications: 0 Running, 0 Completed
* Drivers: 0 Running, 0 Completed
* Status: ALIVE
...
(Pull requests welcome for an alternative that uses the service IP and port)
The Spark workers do the heavy lifting in a Spark cluster. They provide execution resources and data cache capabilities for your program.
The Spark workers need the Master service to be running.
Use the examples/spark/spark-worker-controller.json
file to create a
ReplicationController that manages the worker pods.
$ kubectl create -f examples/spark/spark-worker-controller.json
$ links -dump 192.168.90.14:8080
[IMG] 1.2.1 Spark Master at spark://spark-master:7077
* URL: spark://spark-master:7077
* Workers: 3
* Cores: 12 Total, 0 Used
* Memory: 20.4 GB Total, 0.0 B Used
* Applications: 0 Running, 0 Completed
* Drivers: 0 Running, 0 Completed
* Status: ALIVE
Workers
Id Address State Cores Memory
4 (0 6.8 GB
worker-20150318151745-192.168.75.14-46422 192.168.75.14:46422 ALIVE Used) (0.0 B
Used)
4 (0 6.8 GB
worker-20150318151746-192.168.35.17-53654 192.168.35.17:53654 ALIVE Used) (0.0 B
Used)
4 (0 6.8 GB
worker-20150318151746-192.168.90.17-50490 192.168.90.17:50490 ALIVE Used) (0.0 B
Used)
...
(Pull requests welcome for an alternative that uses the service IP and port)
$ kubectl get pods,services
POD IP CONTAINER(S) IMAGE(S) HOST LABELS STATUS
spark-master 192.168.90.14 spark-master mattf/spark-master 172.18.145.8/172.18.145.8 name=spark-master Running
spark-worker-controller-51wgg 192.168.75.14 spark-worker mattf/spark-worker 172.18.145.9/172.18.145.9 name=spark-worker,uses=spark-master Running
spark-worker-controller-5v48c 192.168.90.17 spark-worker mattf/spark-worker 172.18.145.8/172.18.145.8 name=spark-worker,uses=spark-master Running
spark-worker-controller-ehq23 192.168.35.17 spark-worker mattf/spark-worker 172.18.145.12/172.18.145.12 name=spark-worker,uses=spark-master Running
NAME LABELS SELECTOR IP PORT
kubernetes component=apiserver,provider=kubernetes <none> 10.254.0.2 443
kubernetes-ro component=apiserver,provider=kubernetes <none> 10.254.0.1 80
spark-master name=spark-master name=spark-master 10.254.125.166 7077
$ sudo docker run -it mattf/spark-base sh
sh-4.2# echo "10.254.125.166 spark-master" >> /etc/hosts
sh-4.2# export SPARK_LOCAL_HOSTNAME=$(hostname -i)
sh-4.2# MASTER=spark://spark-master:7077 pyspark
Python 2.7.5 (default, Jun 17 2014, 18:11:42)
[GCC 4.8.2 20140120 (Red Hat 4.8.2-16)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
Welcome to
____ __
/ __/__ ___ _____/ /__
_\ \/ _ \/ _ `/ __/ '_/
/__ / .__/\_,_/_/ /_/\_\ version 1.2.1
/_/
Using Python version 2.7.5 (default, Jun 17 2014 18:11:42)
SparkContext available as sc.
>>> import socket, resource
>>> sc.parallelize(range(1000)).map(lambda x: (socket.gethostname(), resource.getrlimit(resource.RLIMIT_NOFILE))).distinct().collect()
[('spark-worker-controller-ehq23', (1048576, 1048576)), ('spark-worker-controller-5v48c', (1048576, 1048576)), ('spark-worker-controller-51wgg', (1048576, 1048576))]
kubectl create -f spark-master.json
kubectl create -f spark-master-service.json
Make sure the Master Pod is running (use: kubectl get pods
).
kubectl create -f spark-worker-controller.json