-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEATURE] Support more chunking strategies #1081
Comments
HI @orpiske this sounds great! We are looking forward to improve this in the LC4J, so any contributions are welcome! What chunking strategies do you have in mind? |
It's great news that you have plans for Markdown (and/or AsciiDoc) and semantic splitter! Those would have been very useful for our project. In general, specialized chunkers/splitters (and/or an interface for implementing those) could be particularly helpful (i.e.: so we could deal with YAML, XML, etc) for a subset of our data. I also think that it API/services/LLM-based chucking, where we defer the chunking to an external service, could be useful. |
The interface you are looking for is |
Noted, thanks! |
For example this one would be very useful: |
@orpiske @fedecompa I was thinking of contributing the ones I presented in my recent Advanced RAG talk:
|
@glaforge I am implementing a SemanticChunker class myself. Based on your experience, what is the optimal semantic similarity threshold? I have tried using a default value of 0.85. |
The approach I was taking was more to find the lowest similarities (ordering them), and deciding on how many breakpoints I want: https://github.com/datastaxdevs/conference-2024-devoxx/blob/main/devoxx-rag-naive-to-advanced/src/test/java/devoxx/rag/_3_advanced_rag_ingestion/_39_semantic_chunking.java There's another thing to keep in mind, it's the fact that you could also have somewhere a big chunk that is longer than the size the embedding model can accept. So maybe some refinement to further split a big chunk to find just the lowest similarities in this chunk alone. One more thought... sometimes, you may have a sentence somewhere whose content is really different than the rest of the paragraph it's in. For example, a joke, or a remark pointing at some related topic, etc. To avoid having breakpoints there, I'm using sliding windows to measure similarity between sliding windows rather than at the boundary. One more refinement, should we have an overlap somehow, to ensure we're not missing some useful context surrounding the breakpoint? To be honest, my impression is that semantic chunking doesn't really give such good splits. |
@glaforge I have reached the same conclusions. I also had to deal with the fact that we can have, somewhere, a big chunk that is longer than the size supported by the embedding model. Furthermore, I can confirm the conclusions of the paper: this chunking strategy has high computational costs associated with it. |
@fedecompa in my experience, one of the best chunking I've tested was to do sliding windows of sentences. You split by sentences, you embed that sentence, but you return the surrounding sentences with it. It worked great for me. But back to semantic chunking, maybe we can tweak it with some of the things we've already discussed, like max segment size, like sliding windows. And instead of looking purely at threshold, using a dichotomy search would help to accommodate max segment size (ie. split in two at lowest similarity breakpoint, then look at both sides, split if needed again at next lowest similarity breakpoint, etc) Another impression I have is that a good document parser, where you split at section boundaries (chapter 1, section 2, paragraph 3, etc.) gives better boundaries, as it's the boundaries that the author actually decided for the document. But I can imagine that for something like a novel, you can't apply such structured section approach, as novels are often just split in big chapters, and at best small paragraphs. That's maybe for such kind of documents that semantic chunking may be better, I don't know. Somehow, we'd need to have some kind of benchmark to use to be able to measure more scientifically what strategy is best. |
@glaforge Yes, a good semantic chunker should probably first use a Hierarchical Document Splitter based on the document's intrinsic structure and then apply the semantic strategy only to overly long paragraphs (greater than maxSegmentSizeInTokens=max_position_embeddings). |
Is your feature request related to a problem? Please describe.
One of the lessons we learned from a project we worked recently was that there doesn't seem to be great/widespread support for Chunking in Java. We were particularly looking for support for different chunking strategies. That could have helped us maximize our ability to store, retrieve and match data in our VectorDB.
Describe the solution you'd like
We would like to discuss with the Langchain4j community whether they having support for chunking feasible within this project and aligned with the project goals and feature set.
Describe alternatives you've considered
Among other things, we have considered creating a chunking library as a separate project, but we believe that adding a chunking library as part of Langchain4j would result in a better developer experience and would also allow the project to, more easily, implement chunking strategies that would involve LLM-based chunking.
Additional context
If the community believes that this is in line with the project, we are motivated to contribute and help maintain this feature.
The text was updated successfully, but these errors were encountered: