Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEATURE] Support more chunking strategies #1081

Open
orpiske opened this issue May 10, 2024 · 11 comments
Open

[FEATURE] Support more chunking strategies #1081

orpiske opened this issue May 10, 2024 · 11 comments
Labels
enhancement New feature or request P2 High priority RAG

Comments

@orpiske
Copy link

orpiske commented May 10, 2024

Is your feature request related to a problem? Please describe.

One of the lessons we learned from a project we worked recently was that there doesn't seem to be great/widespread support for Chunking in Java. We were particularly looking for support for different chunking strategies. That could have helped us maximize our ability to store, retrieve and match data in our VectorDB.

Describe the solution you'd like

We would like to discuss with the Langchain4j community whether they having support for chunking feasible within this project and aligned with the project goals and feature set.

Describe alternatives you've considered

Among other things, we have considered creating a chunking library as a separate project, but we believe that adding a chunking library as part of Langchain4j would result in a better developer experience and would also allow the project to, more easily, implement chunking strategies that would involve LLM-based chunking.

Additional context

If the community believes that this is in line with the project, we are motivated to contribute and help maintain this feature.

@orpiske orpiske added the enhancement New feature or request label May 10, 2024
@langchain4j
Copy link
Owner

HI @orpiske this sounds great! We are looking forward to improve this in the LC4J, so any contributions are welcome!
Apart from existing document splitters we plan to add markdown splitter and semantic splitter in the near future.

What chunking strategies do you have in mind?

@orpiske
Copy link
Author

orpiske commented May 10, 2024

It's great news that you have plans for Markdown (and/or AsciiDoc) and semantic splitter! Those would have been very useful for our project.

In general, specialized chunkers/splitters (and/or an interface for implementing those) could be particularly helpful (i.e.: so we could deal with YAML, XML, etc) for a subset of our data.

I also think that it API/services/LLM-based chucking, where we defer the chunking to an external service, could be useful.

@langchain4j
Copy link
Owner

The interface you are looking for is DocumentSplitter, please take a look at it and it's existing implementations

@orpiske
Copy link
Author

orpiske commented May 10, 2024

The interface you are looking for is DocumentSplitter, please take a look at it and it's existing implementations

Noted, thanks!

@langchain4j langchain4j added the P3 Medium priority label May 13, 2024
@langchain4j langchain4j changed the title [FEATURE] Support for chunking strategies [FEATURE] Support more chunking strategies Jul 4, 2024
@langchain4j langchain4j added P2 High priority RAG and removed P3 Medium priority labels Oct 22, 2024
@fedecompa
Copy link

For example this one would be very useful:
https://python.langchain.com/v0.2/docs/how_to/semantic-chunker/

@glaforge
Copy link
Collaborator

@orpiske @fedecompa I was thinking of contributing the ones I presented in my recent Advanced RAG talk:

  • Sliding window parent/child chunking (you embed a sentence but store/return the surrounding sentences)
  • Hypothetical questions
  • Contextual retrieval (from Anthropic)
  • Semantic chunking (from Greg Kamradt) with a sliding window of sentences

@fedecompa
Copy link

@glaforge I am implementing a SemanticChunker class myself. Based on your experience, what is the optimal semantic similarity threshold? I have tried using a default value of 0.85.

@glaforge
Copy link
Collaborator

The approach I was taking was more to find the lowest similarities (ordering them), and deciding on how many breakpoints I want: https://github.com/datastaxdevs/conference-2024-devoxx/blob/main/devoxx-rag-naive-to-advanced/src/test/java/devoxx/rag/_3_advanced_rag_ingestion/_39_semantic_chunking.java

There's another thing to keep in mind, it's the fact that you could also have somewhere a big chunk that is longer than the size the embedding model can accept. So maybe some refinement to further split a big chunk to find just the lowest similarities in this chunk alone.

One more thought... sometimes, you may have a sentence somewhere whose content is really different than the rest of the paragraph it's in. For example, a joke, or a remark pointing at some related topic, etc. To avoid having breakpoints there, I'm using sliding windows to measure similarity between sliding windows rather than at the boundary.

One more refinement, should we have an overlap somehow, to ensure we're not missing some useful context surrounding the breakpoint?

To be honest, my impression is that semantic chunking doesn't really give such good splits.
Also, have a look at this paper which seems to indicate it's not necessarily doing that great:
https://arxiv.org/abs/2410.13070

@fedecompa
Copy link

fedecompa commented Dec 18, 2024

@glaforge I have reached the same conclusions. I also had to deal with the fact that we can have, somewhere, a big chunk that is longer than the size supported by the embedding model. Furthermore, I can confirm the conclusions of the paper: this chunking strategy has high computational costs associated with it.

@glaforge
Copy link
Collaborator

@fedecompa in my experience, one of the best chunking I've tested was to do sliding windows of sentences. You split by sentences, you embed that sentence, but you return the surrounding sentences with it. It worked great for me.

But back to semantic chunking, maybe we can tweak it with some of the things we've already discussed, like max segment size, like sliding windows. And instead of looking purely at threshold, using a dichotomy search would help to accommodate max segment size (ie. split in two at lowest similarity breakpoint, then look at both sides, split if needed again at next lowest similarity breakpoint, etc)

Another impression I have is that a good document parser, where you split at section boundaries (chapter 1, section 2, paragraph 3, etc.) gives better boundaries, as it's the boundaries that the author actually decided for the document. But I can imagine that for something like a novel, you can't apply such structured section approach, as novels are often just split in big chapters, and at best small paragraphs. That's maybe for such kind of documents that semantic chunking may be better, I don't know.

Somehow, we'd need to have some kind of benchmark to use to be able to measure more scientifically what strategy is best.

@fedecompa
Copy link

@glaforge Yes, a good semantic chunker should probably first use a Hierarchical Document Splitter based on the document's intrinsic structure and then apply the semantic strategy only to overly long paragraphs (greater than maxSegmentSizeInTokens=max_position_embeddings).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P2 High priority RAG
Projects
None yet
Development

No branches or pull requests

4 participants