[FEATURE] Support more chunking strategies #1081

orpiske · 2024-05-10T10:46:47Z

Is your feature request related to a problem? Please describe.

One of the lessons we learned from a project we worked recently was that there doesn't seem to be great/widespread support for Chunking in Java. We were particularly looking for support for different chunking strategies. That could have helped us maximize our ability to store, retrieve and match data in our VectorDB.

Describe the solution you'd like

We would like to discuss with the Langchain4j community whether they having support for chunking feasible within this project and aligned with the project goals and feature set.

Describe alternatives you've considered

Among other things, we have considered creating a chunking library as a separate project, but we believe that adding a chunking library as part of Langchain4j would result in a better developer experience and would also allow the project to, more easily, implement chunking strategies that would involve LLM-based chunking.

Additional context

If the community believes that this is in line with the project, we are motivated to contribute and help maintain this feature.

langchain4j · 2024-05-10T11:16:00Z

HI @orpiske this sounds great! We are looking forward to improve this in the LC4J, so any contributions are welcome!
Apart from existing document splitters we plan to add markdown splitter and semantic splitter in the near future.

What chunking strategies do you have in mind?

orpiske · 2024-05-10T12:00:20Z

It's great news that you have plans for Markdown (and/or AsciiDoc) and semantic splitter! Those would have been very useful for our project.

In general, specialized chunkers/splitters (and/or an interface for implementing those) could be particularly helpful (i.e.: so we could deal with YAML, XML, etc) for a subset of our data.

I also think that it API/services/LLM-based chucking, where we defer the chunking to an external service, could be useful.

langchain4j · 2024-05-10T13:31:24Z

The interface you are looking for is DocumentSplitter, please take a look at it and it's existing implementations

orpiske · 2024-05-10T13:52:28Z

The interface you are looking for is DocumentSplitter, please take a look at it and it's existing implementations

Noted, thanks!

fedecompa · 2024-11-11T14:43:24Z

For example this one would be very useful:
https://python.langchain.com/v0.2/docs/how_to/semantic-chunker/

glaforge · 2024-11-11T16:18:44Z

@orpiske @fedecompa I was thinking of contributing the ones I presented in my recent Advanced RAG talk:

Sliding window parent/child chunking (you embed a sentence but store/return the surrounding sentences)
Hypothetical questions
Contextual retrieval (from Anthropic)
Semantic chunking (from Greg Kamradt) with a sliding window of sentences

fedecompa · 2024-12-18T08:05:12Z

@glaforge I am implementing a SemanticChunker class myself. Based on your experience, what is the optimal semantic similarity threshold? I have tried using a default value of 0.85.

glaforge · 2024-12-18T08:44:14Z

The approach I was taking was more to find the lowest similarities (ordering them), and deciding on how many breakpoints I want: https://github.com/datastaxdevs/conference-2024-devoxx/blob/main/devoxx-rag-naive-to-advanced/src/test/java/devoxx/rag/_3_advanced_rag_ingestion/_39_semantic_chunking.java

There's another thing to keep in mind, it's the fact that you could also have somewhere a big chunk that is longer than the size the embedding model can accept. So maybe some refinement to further split a big chunk to find just the lowest similarities in this chunk alone.

One more thought... sometimes, you may have a sentence somewhere whose content is really different than the rest of the paragraph it's in. For example, a joke, or a remark pointing at some related topic, etc. To avoid having breakpoints there, I'm using sliding windows to measure similarity between sliding windows rather than at the boundary.

One more refinement, should we have an overlap somehow, to ensure we're not missing some useful context surrounding the breakpoint?

To be honest, my impression is that semantic chunking doesn't really give such good splits.
Also, have a look at this paper which seems to indicate it's not necessarily doing that great:
https://arxiv.org/abs/2410.13070

fedecompa · 2024-12-18T09:19:30Z

@glaforge I have reached the same conclusions. I also had to deal with the fact that we can have, somewhere, a big chunk that is longer than the size supported by the embedding model. Furthermore, I can confirm the conclusions of the paper: this chunking strategy has high computational costs associated with it.

glaforge · 2024-12-18T09:45:52Z

@fedecompa in my experience, one of the best chunking I've tested was to do sliding windows of sentences. You split by sentences, you embed that sentence, but you return the surrounding sentences with it. It worked great for me.

But back to semantic chunking, maybe we can tweak it with some of the things we've already discussed, like max segment size, like sliding windows. And instead of looking purely at threshold, using a dichotomy search would help to accommodate max segment size (ie. split in two at lowest similarity breakpoint, then look at both sides, split if needed again at next lowest similarity breakpoint, etc)

Another impression I have is that a good document parser, where you split at section boundaries (chapter 1, section 2, paragraph 3, etc.) gives better boundaries, as it's the boundaries that the author actually decided for the document. But I can imagine that for something like a novel, you can't apply such structured section approach, as novels are often just split in big chapters, and at best small paragraphs. That's maybe for such kind of documents that semantic chunking may be better, I don't know.

Somehow, we'd need to have some kind of benchmark to use to be able to measure more scientifically what strategy is best.

fedecompa · 2024-12-18T14:17:34Z

@glaforge Yes, a good semantic chunker should probably first use a Hierarchical Document Splitter based on the document's intrinsic structure and then apply the semantic strategy only to overly long paragraphs (greater than maxSegmentSizeInTokens=max_position_embeddings).

orpiske added the enhancement New feature or request label May 10, 2024

orpiske mentioned this issue May 10, 2024

Investigate chunking strategies megacamelus/camel-assistant#74

Open

4 tasks

langchain4j added the P3 Medium priority label May 13, 2024

langchain4j mentioned this issue Jun 5, 2024

[FEATURE] Add language code splitter #1228

Open

langchain4j changed the title ~~[FEATURE] Support for chunking strategies~~ [FEATURE] Support more chunking strategies Jul 4, 2024

langchain4j added P2 High priority RAG and removed P3 Medium priority labels Oct 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEATURE] Support more chunking strategies #1081

[FEATURE] Support more chunking strategies #1081

orpiske commented May 10, 2024

langchain4j commented May 10, 2024

orpiske commented May 10, 2024 •

edited

Loading

langchain4j commented May 10, 2024

orpiske commented May 10, 2024

fedecompa commented Nov 11, 2024

glaforge commented Nov 11, 2024

fedecompa commented Dec 18, 2024

glaforge commented Dec 18, 2024

fedecompa commented Dec 18, 2024 •

edited

Loading

glaforge commented Dec 18, 2024

fedecompa commented Dec 18, 2024

[FEATURE] Support more chunking strategies #1081

[FEATURE] Support more chunking strategies #1081

Comments

orpiske commented May 10, 2024

langchain4j commented May 10, 2024

orpiske commented May 10, 2024 • edited Loading

langchain4j commented May 10, 2024

orpiske commented May 10, 2024

fedecompa commented Nov 11, 2024

glaforge commented Nov 11, 2024

fedecompa commented Dec 18, 2024

glaforge commented Dec 18, 2024

fedecompa commented Dec 18, 2024 • edited Loading

glaforge commented Dec 18, 2024

fedecompa commented Dec 18, 2024

orpiske commented May 10, 2024 •

edited

Loading

fedecompa commented Dec 18, 2024 •

edited

Loading