Releases: chonkie-ai/chonkie
v0.3.0
Highlights
- Added
LateChunker
support! You can useLateChunker
in the following manner:
from chonkie import LateChunker
chunker = LateChunker(
embedding_model="jinaai/jina-embeddings-v3",
mode="sentence",
trust_remote_code=True
)
- Added Chonkie Discord to the repository~ Join now to connect with the community! Oh, btw, Chonkie is now on Twitter and Bluesky too!
- Bunch of bug fixes to improve chunkers' stability...
What's Changed
- [Fix] #37: Incorrect indexing when repetition is present in the text by @bhavnicksm in #87
- [Fix] #88: SemanticChunker raises UnboundLocalError: local variable 'threshold' referenced before assignment by @arpesenti in #89
- [Fix] WordChunker chunk_batch fail by @sky-2002 in #90
- [FIX] MEGA Bug Fix PR: Fix WordChunker batching, Fix SentenceChunker token counts, Initialization + more by @bhavnicksm in #96
- Add initial support for Late Chunking by @bhavnicksm in #97
- [FEAT] Add LateChunker by @bhavnicksm in #98
- [FIX] Update outdated package versions + set max limit to numpy to v2.2 (buggy) by @bhavnicksm in #99
- Update version to 0.3.0 in pyproject.toml and init.py by @bhavnicksm in #100
- [fix] Add LateChunker support to chunker and module exports by @bhavnicksm in #101
- [fix] Docstrings in SemanticChunker should include **kwargs by @bhavnicksm in #102
- [Minor] Add Discord badge to README for community engagement by @bhavnicksm in #103
New Contributors
- @arpesenti made their first contribution in #89
Full Changelog: v0.2.2...v0.3.0
v0.2.2
Highlights
- Added Token Estimate Validate Loops inside the SentenceChunker for higher speed of upto ~5x at times
- Added
auto
thresholding mode for SemanticChunkers to removesimilarity_threshold
hard requirement. SemanticChunkers can decide on their own threshold now, based on the minimum and maximum - Added
OverlapRefinery
for adding overlap context to the chunks.chunk_overlap
parameter will be deprecated in the future forOverlapRefinery
instead.
What's Changed
- [Fix] AutoEmbeddings not loading
all-minilm-l6-v2
but loadsAll-MiniLM-L6-V2
by @bhavnicksm in #57 - [Fix] Add fix for #55 by @bhavnicksm in #58
- [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
- [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62
- [Update] Change default embedding model in SemanticChunkers by @bhavnicksm in #63
- Add
min_chunk_size
to SDPMChunker + Lint codebase with ruff + minor changes by @bhavnicksm in #68 - Added automated testing using Github Actions by @pratyushmittal in #66
- Add support for automated testing with Github Actions by @bhavnicksm in #69
- [Fix] Allow for functions as token_counters in BaseChunkers by @bhavnicksm in #70
- Add TEVL to speed up sentence chunker by @bhavnicksm in #71
- Add TEVL to speed-up sentence chunking by @bhavnicksm in #72
- Update the docs path to docs.chonkie.ai by @bhavnicksm in #75
- [FEAT] Add BaseRefinery and OverlapRefinery support by @bhavnicksm in #77
- Add support for BaseRefinery and OverlapRefinery + minor changes by @bhavnicksm in #78
- [FEAT] Add "auto" threshold configuration via Statistical analysis in SemanticChunker + minor fixes by @bhavnicksm in #79
- [Fix] Unify dataclasses under a types.py for ease by @bhavnicksm in #80
- Expose the seperation delim for simple multilingual chunking by @bhavnicksm in #81
- Bump version to v0.2.2 for release by @bhavnicksm in #82
New Contributors
- @pratyushmittal made their first contribution in #66
Full Changelog: v0.2.1...v0.2.2
v0.2.1.post1
Highlights
This patch fix allows for AutoEmbeddings to properly default to SentenceTransformerEmbeddings
which was being by-passed in the previous release.
Furthermore, because of reconstructable splitting, numerous smaller sentences were making it through to the SemanticChunker. To subvert the issue, this fix introduces a min_chunk_size
which takes in the minimum tokens that need to be in a chunk. This solves the issues in the tests.
What's Changed
- [Fix] AutoEmbeddings not loading
all-minilm-l6-v2
but loadsAll-MiniLM-L6-V2
by @bhavnicksm in #57 - [Fix] Add fix for #55 by @bhavnicksm in #58
- [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
- [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62
Full Changelog: v0.2.1...v0.2.1.post1
v0.2.1
Breaking Changes
- SemanticChunker no longer accepts SentenceTransformer models directly; instead, this release uses the
SentenceTransformerEmbeddings
class, which can take in a model directly. Future releases will add the functionality to auto-detect and create embeddings inside theAutoEmbeddings
class. - By default,
semantic
optional installation now depends onModel2VecEmbeddings
and hencemodel2vec
python package from this release onwards, due to size and speed benefits.Model2Vec
uses static embeddings which are good enough for the task of chunking while being 10x faster than standard Sentence Transformers and being a 10x lighter dependency. SemanticChunker
andSDPMChunker
now use the argumentchunk_size
instead ofmax_chunk_size
for uniformity across the chunkers, but the internal representation remains the same.
What's Changed
- [BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
- [DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
- Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32
- Use
__slots__
instead ofslots=True
for python3.9 support by @bhavnicksm in #34 - Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35
- [FEAT] Add SentenceTransformerEmbeddings, EmbeddingsRegistry and AutoEmbeddings provider support by @bhavnicksm in #44
- Refactor BaseChunker, SemanticChunker and SDPMChunker to support BaseEmbeddings by @bhavnicksm in #45
- Add initial OpenAIEmbeddings support to Chonkie ✨ by @bhavnicksm in #46
- [DOCS] Add info about initial embeddings support and how to add custom embeddings by @bhavnicksm in #47
- [FEAT] - Add model2vec embedding models by @sky-2002 in #41
- [FEAT] Add support for Model2VecEmbeddings + Switch default embeddings to Model2VecEmbeddings by @bhavnicksm in #49
- [fix] Reorganize optional dependencies in pyproject.toml: rename 'sem… by @bhavnicksm in #51
- [Fix] Token counts from Tokenizers and Transformers adding special tokens by @bhavnicksm in #52
- [Fix] Refactor WordChunker, SentenceChunker pre-chunk splitting for reconstruction tests + minor changes by @bhavnicksm in #53
- [Refactor] Optimize similarity calculation by using np.divide for imp… by @bhavnicksm in #54
New Contributors
- @mrmps made their first contribution in #29
- @jasonacox made their first contribution in #30
- @sky-2002 made their first contribution in #41
Full Changelog: v0.2.0...v0.2.1
v0.2.0.post1
Highlights
This patch was added to fix support for python3.9 with Dataclass slots. Earlier we were using slots=True
which would only work for python 3.10 onwards. This also works in python3.10+ versions.
What's Changed
- Use
__slots__
instead ofslots=True
for python3.9 support by @bhavnicksm in #34 - Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35
Full Changelog: v0.2.0.post1...v0.2.0.post2
v0.2.0
Breaking Changes
- Semantic Chunkers will no longer be taking in an additional
tokenizer
object, and would instead infer the tokenizer from theembedding_model
passed. This is done to ensure that thetokenizer
andembedding_model
token counts always match as that is a necessary condition for some of the optimizations on them.
What's Changed
- Update Docs by @bhavnicksm in #14
- Update acknowledgements in README.md for improved clarity and appreci… by @bhavnicksm in #15
- Update README.md + fix DOCS.md typo by @bhavnicksm in #17
- Remove Spacy dependency from Chonkie by @bhavnicksm in #20
- Remove Spacy dependency from 'sentence' install + Add FAQ to DOCS.md by @bhavnicksm in #21
- Update README.md + minor updates by @bhavnicksm in #22
- fix: tokenizer mismatch for
SemanticChunker
+ Add BaseEmbeddings by @bhavnicksm in #24 - Update dependency version of SentenceTransformer to at least 2.3.0 by @bhavnicksm in #27
- Add initial batching support via
chunk_batch
fn + update DOCS by @bhavnicksm in #28 - [BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
- [DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
- Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32
New Contributors
- @mrmps made their first contribution in #29
- @jasonacox made their first contribution in #30
Full Changelog: v0.1.2...v0.2.0.post1
v0.1.2
What's Changed
- Make imports as a part of Chunker init instead of file imports to make Chonkie import faster by @bhavnicksm in #12
- Run Black + Isort + beautify the code a bit by @bhavnicksm in #13
Full Changelog: v0.1.1...v0.1.2
v0.1.1
What's Changed
- Update README.md by @bhavnicksm in #10
- Bump version to 0.1.1 in pyproject.toml and init.py by @bhavnicksm in #11
Full Changelog: v0.1.0...v0.1.1
v0.1.0
What's Changed
- Disentangle the Embedding Model from SemanticChunker + Update DOCS and README by @bhavnicksm in #9
Full Changelog: v0.0.3...v0.1.0
v0.0.3
What's Changed
- Bump version to 0.0.2 in pyproject.toml and init.py for release by @bhavnicksm in #6
- Update README.md + remove .github action by @bhavnicksm in #7
- Bump version to 0.0.3 in pyproject.toml and init.py for release by @bhavnicksm in #8
Full Changelog: v0.0.2...v0.0.3