This project deploys a standalone Spark Cluster onto a Docker Swarm. Includes the Quantcast File System (QFS) as the clusters distributed file system. Why QFS? Why not. This configuration will also launch and make available a Jupyter PySpark notebook that is connected to the Spark cluster. The cluster has matplotlib
and pandas
preinstalled for your PySpark on Jupyter joys.
First, edit the following items as needed for your swarm:
worker-node -> spark-conf -> spark-env.sh
: adjust the environment variables as appropriate for your cluster's nodes, most notablySPARK_WORKER_MEMORY
andSPARK_WORKER_CORES
. Leave 1-2 cores and at least 10% of RAM for other processes.worker-node -> spark-conf -> spark-env.sh
: Adjust the memory and core settings for the executors and driver. Each executor should have about 5 cores (if possible), and should be a whole divisor intoSPARK_WORKER_CORES
. Spark will launch as many executors asSPARK_WORKER_CORES
divided byspark.executor.cores
. Reserve about 7-8% ofSPARK_WORKER_MEMORY
for overhead when settingspark.executor.memory
.build-images.sh
: Adjust the IP address for your local Docker registry that all nodes in your cluster can access. You can use a domain name if all nodes in your swarm can resolve it. This is needed as it allows all nodes in the swarm to pull the locally built Docker images.deploy-spark-qfs-swarm.yml
: Adjust all image names for the updated local Docker registry address you used in the prior step. Also, adjust the resource limits for each of the services. Setting acpus
limit here that is smaller than the number of cores on your node has the effect of giving your process a fraction of each core's capacity. You might consider doing this if your swarm hosts other services or does not handle long term 100% CPU load well (e.g., overheats).
This set up depends on have a GlusterFS volume mounted at /mnt/gfs
and a normal file system (such as XFS) at /mnt/data
on all nodes and the following directories exist on it:
/mnt/gfs/jupyter-notbooks
- used to persist the Jupyter notebooks./mnt/gfs/data
- a location to transitionally store data that is accessible from the Jupyter server/mnt/data/qfs/logs
- where QFS will store it's logs/mnt/data/qfs/chunk
- Where the chunk servers of QFS will store the data/mnt/data/qfs/checkpoint
- Where the QFS metaserver will store the fulesystem check points. This actually only needs to exist on the master node./mnt/data/spark
- The local working directory for spark
You can adjust these as you see fit, but be sure to update the mounts specified in deploy-spark-qfs-swarm.yml
. Then build the docker images from in this project's directory:
./build-images.sh
Before the first time you run this cluster, you will need to initialize the QFS file system. Do so by launching a qfs-master container on the master node:
docker run -it -u spark --mount type=bind,source=/mnt/data/qfs,target=/data/qfs master:5000/qfs-master:latest /bin/bash
Then at the shell prompt in this container, run the following to initialize QFS and create the directory for Spark history server:
$QFS_HOME/bin/metaserver -c $QFS_HOME/conf/Metaserver.prp
qfs -mkdir /history/spark-event
exit
Finally, to start up the Spark cluster in your Docker swarm:
docker stack deploy -c deploy-spark-qfs-swarm.yml spark
Point your development computer's browser at http://swarm-public-ip:7777/
to load the Jupyter notebook.
To launch a Docker container to give you command line access to QFS, use the following command:
docker run -it --network="spark_cluster_network" -u spark master:5000/qfs-master:latest /bin/bash
Note that you must attach to the network on which the Docker spark cluster services are using. From this command prompt, the following commands are pre-configured to connect to the QFS instance:
qfs
- enables most linux-style file operations on the QFS instance.cptoqfs
- Copies files from the local file system (in the Docker container) to the QFS instance.cpfromqfs
- Copies files from the QFS instance to the local file system (in the Docker container)qfsshell
- A useful shell-style interface to the QFS instanceqfsfsck
- Performfsck
on the QFS file system
You might consider adding a volume mount to the docker run
command so that the Docker container can access data from you local file system.