Using Java/Scala to work with Kafka is hard (especially for someone like me). Ta-da!
Kafka is a distributed messaging system. It is at the core of many production systems in places such as Uber and LinkedIn (who created Kafka).
The architecture is a publish-subscribe model, where consumers read messages from topics that they have subscribed, where the messages are sent by producers.
Common use-cases:
- messaging between applications, where you can have applications "talk" to each using messages
- data processing pipelines from source systems to target destinations, thereby processing information on a streaming basis, rather than in batches as with your traditional ETL jobs
From Confluent.io
There is a lot more complexity under the hood, and I suggest you read the official docs for more information.
Download the 2.0.0 release and un-tar it.
tar -xzf kafka_2.11-2.0.0.tgz
Also, install pip requirements by running pip install -r requirements.txt
Kafka uses ZooKeeper so you need to first start a ZooKeeper server.
cd kafka_2.11-2.0.0
bin/zookeeper-server-start.sh config/zookeeper.properties
Now start the Kafka server:
bin/kafka-server-start.sh config/server.properties
Create a topic named "test":
bin/kafka-topics.sh --create --zookeeper localhost:2181 --replication-factor 1 --partitions 1 --topic test
See list of topics using the following command:
bin/kafka-topics.sh --list --zookeeper localhost:2181 test
In separate command shells, run the following:
python consumer.py
This is a consumer of the messages sent through Kafka. Simple writing to CSV of the streams is implemented.
Note: Press CTRL + C to send KeyboardInterrupt to exit the process. Alternatively, close the shell session.
python producer.py
Producer of messages. Key in any valid string to send. Type "quit" to exit.
You should now see the shell running consumer.py
displaying the messages from Kafka!
bin/kafka-server-stop.sh
bin/zookeeper-server-stop.sh
This would terminate both server processes.