A thick-write-only-client for writing across several ClickHouse MergeTree tables located in different shards.
It is a good alternative to writing via Clickhouse Distributed Engine which has been proven to be a bad idea for several reasons.
The core functionality is the writer. It works on top of Apache Spark and takes DataFrame as an input.
The writer can also write data w/o duplicates from repeatable source. For example it can be very useful in achieving EOS semantic when Kafka-Clickhouse sink is needed. It is a good alternative to ClickHouse Kafka Engine.
To make it work the writer needs the following:
- The mechanism relies on Clickhouse deduplication for same data blocks. See: https://clickhouse.tech/docs/en/operations/settings/settings/#settings-insert-deduplicate.
- A sorting key will be used to sort rows inside a partition (so Clickhouse can understand it is the same data). It could be any columns which lead to a stable sorting inside a partition.
Here is a pseudo-code how it could be used to consume data from Kafka and insert into ClickHouse written in Spark Structured Streaming (ver. 2.4+)
val streamDF = spark.readStream()
.format("kafka")
.option(<kafka brokers and topics>)
.load();
val writer = new SparkToClickHouseWriter(<my_conf>)
streamDF.forEachBatch(df -> writer.write(df))
There is a talk about problems with ClickHouse Distributed and Kafka engines and reasons which forced us to implement this util library.