This repository stores all the required components to build a containerized cluster for Big Data and Data Science applications. It allows scalable production services using technologies such as Machine Learning Python libraries, Apache Spark analytics engine, Scala language, HDFS and Docker containers among others.
The code has been tested using:
- Apache Spark (3.5): an unified analytics engine for Big Data processing, with built-in modules for streaming, SQL, Machine Learning and graph processing. It has high-level APIs in Scala and Python.
- Hadoop (3.4): an open-source software for reliable, scalable, distributed computing. It uses Hadoop Distributed File System (HDFS) which is suitable to work with large RDD (Resilient Distributed Datasets).
- Docker (27.4): an open platform for developers and sysadmins to build, ship, and run distributed applications, whether on laptops, data center VMs, or the cloud.
- Docker Compose (2.32): a tool for defining and running multi-container Docker applications.
The virtual environment employed for Data Science applications is generated from requirements.txt file located in the repository.
The main components of this virtual environment are listed below:
- Python (3.12): an interpreted high-level programming language for general-purpose programming.
- Jupyter Lab (4.3): a web-based interactive development environment for Jupyter Notebooks, code, and data.
- Keras (TensorFlow built-in): a high-level neural networks API, written in Python and capable of running on top of TensorFlow, CNTK, or Theano.
- TensorFlow (2.18): an open source Deep Learning library for high performance numerical computation using data flow graphs.
- Matplotlib (3.10): a plotting library for Python and its numerical mathematics extension NumPy.
- NumPy (2.0): a library for Python, adding support for large, multi-dimensional arrays and matrices, along with a large collection of high-level mathematical functions to operate on these arrays.
- Pandas (2.2): an open source library providing high-performance, easy-to-use data structures and data analysis tools for Python.
- scikit-learn (1.6): a machine learning library for Python. It features various classification, regression and clustering algorithms including support vector machines, random forest, gradient boosting, k-means and DBSCAN.
- scikit-image (0.25): a collection of algorithms for image processing with Python.
- TPOT (0.12): a Python Automated Machine Learning tool that optimizes machine learning pipelines using genetic programming.
- XGBoost (2.1): an optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.
- Folium (0.19): an open source library to visualize data that has been manipulated in Python on an interactive Leaflet.js map.
- ipyleaflet (0.19): a Jupyter / Leaflet.js bridge enabling interactive maps in Jupyter Notebook.
- Seaborn (0.13): a Python visualization library based on Matplotlib. It provides a high-level interface for drawing attractive statistical graphics.
- imbalanced-learn (0.13): a Python package offering a number of re-sampling techniques commonly used in datasets showing strong between-class imbalance. It is compatible with scikit-learn and it allows SMOTE (Synthetic Minority Over-sampling Technique).
- joblib (1.4): a set of tools to provide lightweight pipelining in Python.
- findspark (2.0): a package to make Spark Context available in Jupyter Notebook.
It is available in the Spark master node created with Docker Compose.
Command to access Spark master node:
~/bigdata_docker/$ docker compose exec master bash
~/usr/spark-3.5.4/$
The bigdata_docker main folder contains subfolders, application and data files needed to build Big Data and Data Science solutions:
bigdata_docker
├── .gitignore
├── conf
│ ├── master
│ └── worker
├── data
│ ├── test_log1.csv
│ └── test_log2.csv
├── master
│ ├── Dockerfile
│ └── work_dir
│ ├── notebooks
│ ├── pyproject.toml
│ ├── python_apps
│ │ └── example
│ ├── requirements.txt
│ └── scala_apps
│ └── example
├── docker-compose.yml
└── README.md
- conf: stores Spark configuration files for master and worker nodes. These folders are mapped as volumes in the Docker Compose file and they can be accessed from containers through conf/ path.
- data: folder to contain raw, processed and test data. It is mapped as volume in [docker-compose] and it can be accessed from containers through tmp/data/ path.
- docker-compose.yml: creates the Spark cluster based on Docker in which the applications shall run.
- master: stores all configuration and working files for the Spark master and worker nodes of the cluster created with Docker Compose.
- Dockerfile: defines all required tools, virtual environment and work files to be installed in the Spark master and worker nodes.
- work_dir: stores files employed for Big Data and Data Science applications.
The work_dir folder has the following structure:
work_dir
├── pyproject.toml
├── requirements.txt
├── notebooks
│ └── Example.ipynb
├── python_apps
│ └── example
└── scala_apps
└── example
- requirements.txt: file which defines the dependencies for the virtual environment employed by Python Data Science applications and Jupyter Notebooks.
- notebooks: Jupyter Notebooks for data analysis, elaboration and training of prediction models and testing.
- scala_apps: used to contain Spark applications written in Scala. There is one example application compiled using Maven.
- python_apps: folder to store Python applications. There is one example application.
The system has three main components:
- Containerized Big Data cluster: It shall be the base of the system and it can allow to run large files processing and predictive applications.
- Scala Big Data applications: It shall process the available large data files and extract the relevant information that it will be used to train and feed the predictive models.
- Python Data Science applications: It shall employ Python Data Science libraries to use machine learning models for tasks such as predicitions.
Apart from the three main components listed above Jupyter Notebooks are also utilized for data analysis, modelling and testing of applications.
The system has to be a scalable solution. Thus the applications shall be deployed in a Big Data cluster built on Apache Spark, Hadoop and Docker containers.
The reason for this choice is because Docker enables the utilization of container clustering systems to set up and scale the processing and predictive applications in production. It makes easy to add new containers to handle additional load.
The containers shall run Spark as data engine and HDFS for storage in master and worker nodes. The Spark master node has also Maven and the Python virtual environment installed.
The number of worker nodes can be increased modifying the docker-compose file. By default it creates one master and one worker node.
The following diagram illustrates the Big Data cluster architecture in blocks:
flowchart LR;
Client<-->id1[["Master <br>--------<br>Python <br>Spark <br>HDFS"]];
subgraph Big Data Cluster;
id1[["Master <br>--------<br>Python <br>Spark <br>HDFS"]]<-->id2[["Worker <br>--------<br>Spark <br>HDFS"]];
end;
Other possible improvements in the Big Data cluster that shall not be implemented here could be:
- Use of Kubernetes to manage the Docker containers.
- Take advantage of Cloud Computing services, such as AWS EMR, to build up a Spark cluster with the desired amount of resources and only utilize them when is required for cost efficiency.
The steps and commands to run the Spark cluster with Docker Compose are described below.
Before executing Docker Compose is strongly recommended to close other applications to free up resources and ports to avoid potential issues. Then Docker Compose can be execute to build services:
~/bigdata_docker/$ docker compose build
Next step consists in executing Docker Compose up command:
~/bigdata_docker/$ docker compose up
It is likely that for the first time it could spend some time to download Docker images and additional packages. If everything goes fine at the end the cluster should be ready appearing something similar to:
...
master_1 | 2018-10-19 09:59:53 INFO Master:54 - I have been elected leader! New state: ALIVE
master_1 | 2018-10-19 09:59:53 INFO Master:54 - Registering worker 172.27.0.3:8881 with 2 cores, 2.0 GB RAM
worker_1 | 2018-10-19 09:59:53 INFO Worker:54 - Successfully registered with master spark://master:7077
To shutdown the cluster simply press 'Control+C' and wait patiently to return to shell.
It is necessary to filter and prepare the data from RDDs to extract the relevant information that will be used by Python Data Science applications. The approach to accomplish this task can be the employ of Spark applications programmed in Scala.
A Scala Big Data example application is stored in work_dir/scala_apps/example/ folder and for the first time it must be compiled with Maven to generate the .jar target file. This is done automatically with the Dockerfile but it can be done manually using the following command:
~/usr/spark-3.5.4/work_dir/scala_apps/example$ mvn package
The application requires the parameters min-range-Id, max-range-Id, path-input-log1, path-input-log2, path-output-log.
Command to run the Example application locally in the Spark master node with test logs:
~/usr/spark-3.5.4/work_dir/scala_apps/example$ spark-submit \
--master local[2] \
--class stubs.Example \
target/example-1.0.jar \
1 49999 \
/tmp/data/test_log1.csv \
/tmp/data/test_log2.csv \
/tmp/data/result_local_log
Command to run the Example application in the Spark worker node with test logs:
~/usr/spark-3.5.4/work_dir/scala_apps/example$ spark-submit \
--master spark://master:7077 \
--class stubs.Example \
target/example-1.0.jar \
1 49999 \
/tmp/data/test_log1.csv \
/tmp/data/test_log2.csv \
/tmp/data/result_worker_log
When using larger files it is recommended to tune additional parameters to provide additional resources. e.g. "--driver-memory 10g".
The way to run the Python example application is simple. Just go to work_dir/python_apps/example/ folder and execute it:
Command to access Spark master node:
~/bigdata_docker/$ docker compose exec master bash
~/usr/spark-3.5.4/$
Command to run Python example application in master node:
~/usr/spark-3.5.4/$ cd work_dir/python_apps/example
~/usr/spark-3.5.4/work_dir/python_apps/example$ python3 main.py 10000
A good way to analyze data, build machine learning models and test them is through Jupyter Lab. An example of Jupyter Notebook is stored in the work_dir/notebooks/ folder.
All the required packages to run Jupyter Notebooks remotely in the Spark master node are installed so it is possible to run them through web interface. To achieve this it is necessary to use the commands shown below:
Command to access master node:
~/bigdata_docker/$ docker compose exec master bash
~/usr/spark-3.5.4$
Launch Jupyter Lab service in master node.
~/usr/spark-3.5.4$ jupyter lab \
--notebook-dir=/usr/spark-3.5.4/work_dir/notebooks \
--ip='0.0.0.0' \
--port=8888 \
--no-browser \
--allow-root
Now Jupyter Notebooks stored in the master node can be run remotely. Next step is to open a local web browser and paste the URL printed after executing the launch command to access to the Jupyter Lab interface, checking that the server is running fine. A similar output will be shown:
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://(master or 127.0.0.1):8888/?token=<token>
Valid URL:
http://localhost:8888/?token=<token>
To shutdown the Jupyter Lab service in the master node simply press 'Control+C' and then confirm with 'y'.
author: alvertogit copyright: 2018-2025