The system will process financial data in the form of real-time streaming data. It will consist of several modules: Apache Kafka, Apache Spark, a NoSQL database, and a data visualization tool (Power BI or Grafana). The data source will be the Finnhub.io WebSocket.
All components, except for Finnhub.io, have been containerized to simplify project development and enhance its reliability. Below is a description of each container from left to right in the diagram:
-
Finnhub.io
Data source, a free API providing financial data. -
Data-Producer Container
Responsible for connecting via WebSocket to the Finnhub.io API, serializing the data, and sending it to the Apache Kafka broker. -
Zookeeper Container
ZooKeeper is a centralized service for maintaining configuration information, naming, providing distributed synchronization, and providing group services. -
Kafdrop Container
Used for monitoring the operation of the Apache Kafka broker. -
Apache Kafka Container
A message broker receiving data from the Data-Producer and storing it until it is retrieved by Apache Spark. -
Apache Spark Cluster
Consists of four containers:- Main-Processor: The main container executing jobs to receive data from Apache Kafka, process it into Spark DataFrame, perform necessary data aggregations using PySpark, and save the results to the Apache Cassandra container.
- Spark-Master: The container managing Spark-Worker-1 and Spark-Worker-2, which perform the actual distributed data processing.
-
CassandraDB Container
NoSQL database Apache Cassandra consisting of one cluster (one instance) where processed data from the Apache Spark cluster is stored. -
Grafana Container
Responsible for fetching data from CassandraDB and displaying it on the charts shown below.