Skip to content

Releases: chonkie-ai/chonkie

v0.3.0

23 Dec 18:44
345ca6c
Compare
Choose a tag to compare

Highlights

  • Added LateChunker support! You can use LateChunker in the following manner:
from chonkie import LateChunker

chunker = LateChunker(
    embedding_model="jinaai/jina-embeddings-v3",
    mode="sentence", 
    trust_remote_code=True
)
  • Added Chonkie Discord to the repository~ Join now to connect with the community! Oh, btw, Chonkie is now on Twitter and Bluesky too!
  • Bunch of bug fixes to improve chunkers' stability...

What's Changed

  • [Fix] #37: Incorrect indexing when repetition is present in the text by @bhavnicksm in #87
  • [Fix] #88: SemanticChunker raises UnboundLocalError: local variable 'threshold' referenced before assignment by @arpesenti in #89
  • [Fix] WordChunker chunk_batch fail by @sky-2002 in #90
  • [FIX] MEGA Bug Fix PR: Fix WordChunker batching, Fix SentenceChunker token counts, Initialization + more by @bhavnicksm in #96
  • Add initial support for Late Chunking by @bhavnicksm in #97
  • [FEAT] Add LateChunker by @bhavnicksm in #98
  • [FIX] Update outdated package versions + set max limit to numpy to v2.2 (buggy) by @bhavnicksm in #99
  • Update version to 0.3.0 in pyproject.toml and init.py by @bhavnicksm in #100
  • [fix] Add LateChunker support to chunker and module exports by @bhavnicksm in #101
  • [fix] Docstrings in SemanticChunker should include **kwargs by @bhavnicksm in #102
  • [Minor] Add Discord badge to README for community engagement by @bhavnicksm in #103

New Contributors

Full Changelog: v0.2.2...v0.3.0

v0.2.2

06 Dec 22:56
475f08d
Compare
Choose a tag to compare

Highlights

  • Added Token Estimate Validate Loops inside the SentenceChunker for higher speed of upto ~5x at times
  • Added auto thresholding mode for SemanticChunkers to remove similarity_threshold hard requirement. SemanticChunkers can decide on their own threshold now, based on the minimum and maximum
  • Added OverlapRefinery for adding overlap context to the chunks. chunk_overlap parameter will be deprecated in the future for OverlapRefinery instead.

What's Changed

  • [Fix] AutoEmbeddings not loading all-minilm-l6-v2 but loads All-MiniLM-L6-V2 by @bhavnicksm in #57
  • [Fix] Add fix for #55 by @bhavnicksm in #58
  • [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
  • [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62
  • [Update] Change default embedding model in SemanticChunkers by @bhavnicksm in #63
  • Add min_chunk_size to SDPMChunker + Lint codebase with ruff + minor changes by @bhavnicksm in #68
  • Added automated testing using Github Actions by @pratyushmittal in #66
  • Add support for automated testing with Github Actions by @bhavnicksm in #69
  • [Fix] Allow for functions as token_counters in BaseChunkers by @bhavnicksm in #70
  • Add TEVL to speed up sentence chunker by @bhavnicksm in #71
  • Add TEVL to speed-up sentence chunking by @bhavnicksm in #72
  • Update the docs path to docs.chonkie.ai by @bhavnicksm in #75
  • [FEAT] Add BaseRefinery and OverlapRefinery support by @bhavnicksm in #77
  • Add support for BaseRefinery and OverlapRefinery + minor changes by @bhavnicksm in #78
  • [FEAT] Add "auto" threshold configuration via Statistical analysis in SemanticChunker + minor fixes by @bhavnicksm in #79
  • [Fix] Unify dataclasses under a types.py for ease by @bhavnicksm in #80
  • Expose the seperation delim for simple multilingual chunking by @bhavnicksm in #81
  • Bump version to v0.2.2 for release by @bhavnicksm in #82

New Contributors

Full Changelog: v0.2.1...v0.2.2

v0.2.1.post1

24 Nov 14:41
7b1e480
Compare
Choose a tag to compare

Highlights

This patch fix allows for AutoEmbeddings to properly default to SentenceTransformerEmbeddings which was being by-passed in the previous release.

Furthermore, because of reconstructable splitting, numerous smaller sentences were making it through to the SemanticChunker. To subvert the issue, this fix introduces a min_chunk_size which takes in the minimum tokens that need to be in a chunk. This solves the issues in the tests.

What's Changed

  • [Fix] AutoEmbeddings not loading all-minilm-l6-v2 but loads All-MiniLM-L6-V2 by @bhavnicksm in #57
  • [Fix] Add fix for #55 by @bhavnicksm in #58
  • [Refactor] Add min_chunk_size parameter to SemanticChunker and SentenceChunker by @bhavnicksm in #60
  • [Update] Bump version to 0.2.1.post1 and require Python 3.9 or higher by @bhavnicksm in #62

Full Changelog: v0.2.1...v0.2.1.post1

v0.2.1

22 Nov 10:40
f5768e8
Compare
Choose a tag to compare

Breaking Changes

  • SemanticChunker no longer accepts SentenceTransformer models directly; instead, this release uses the SentenceTransformerEmbeddings class, which can take in a model directly. Future releases will add the functionality to auto-detect and create embeddings inside the AutoEmbeddings class.
  • By default, semantic optional installation now depends on Model2VecEmbeddings and hence model2vec python package from this release onwards, due to size and speed benefits. Model2Vec uses static embeddings which are good enough for the task of chunking while being 10x faster than standard Sentence Transformers and being a 10x lighter dependency.
  • SemanticChunker and SDPMChunker now use the argument chunk_size instead of max_chunk_size for uniformity across the chunkers, but the internal representation remains the same.

What's Changed

  • [BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
  • [DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
  • Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32
  • Use __slots__ instead of slots=True for python3.9 support by @bhavnicksm in #34
  • Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35
  • [FEAT] Add SentenceTransformerEmbeddings, EmbeddingsRegistry and AutoEmbeddings provider support by @bhavnicksm in #44
  • Refactor BaseChunker, SemanticChunker and SDPMChunker to support BaseEmbeddings by @bhavnicksm in #45
  • Add initial OpenAIEmbeddings support to Chonkie ✨ by @bhavnicksm in #46
  • [DOCS] Add info about initial embeddings support and how to add custom embeddings by @bhavnicksm in #47
  • [FEAT] - Add model2vec embedding models by @sky-2002 in #41
  • [FEAT] Add support for Model2VecEmbeddings + Switch default embeddings to Model2VecEmbeddings by @bhavnicksm in #49
  • [fix] Reorganize optional dependencies in pyproject.toml: rename 'sem… by @bhavnicksm in #51
  • [Fix] Token counts from Tokenizers and Transformers adding special tokens by @bhavnicksm in #52
  • [Fix] Refactor WordChunker, SentenceChunker pre-chunk splitting for reconstruction tests + minor changes by @bhavnicksm in #53
  • [Refactor] Optimize similarity calculation by using np.divide for imp… by @bhavnicksm in #54

New Contributors

Full Changelog: v0.2.0...v0.2.1

v0.2.0.post1

18 Nov 09:32
6227b48
Compare
Choose a tag to compare

Highlights

This patch was added to fix support for python3.9 with Dataclass slots. Earlier we were using slots=True which would only work for python 3.10 onwards. This also works in python3.10+ versions.

What's Changed

  • Use __slots__ instead of slots=True for python3.9 support by @bhavnicksm in #34
  • Bump version to 0.2.0.post1 in pyproject.toml and init.py by @bhavnicksm in #35

Full Changelog: v0.2.0.post1...v0.2.0.post2

v0.2.0

17 Nov 13:27
0dd5ecb
Compare
Choose a tag to compare

Breaking Changes

  • Semantic Chunkers will no longer be taking in an additional tokenizer object, and would instead infer the tokenizer from the embedding_model passed. This is done to ensure that the tokenizer and embedding_model token counts always match as that is a necessary condition for some of the optimizations on them.

What's Changed

  • Update Docs by @bhavnicksm in #14
  • Update acknowledgements in README.md for improved clarity and appreci… by @bhavnicksm in #15
  • Update README.md + fix DOCS.md typo by @bhavnicksm in #17
  • Remove Spacy dependency from Chonkie by @bhavnicksm in #20
  • Remove Spacy dependency from 'sentence' install + Add FAQ to DOCS.md by @bhavnicksm in #21
  • Update README.md + minor updates by @bhavnicksm in #22
  • fix: tokenizer mismatch for SemanticChunker + Add BaseEmbeddings by @bhavnicksm in #24
  • Update dependency version of SentenceTransformer to at least 2.3.0 by @bhavnicksm in #27
  • Add initial batching support via chunk_batch fn + update DOCS by @bhavnicksm in #28
  • [BUG] Fix the start_index and end_index to point to character indices, not token indices by @mrmps in #29
  • [DOCS] Fix typo for import tokenizer in quick start example by @jasonacox in #30
  • Major Update: Fix bugs + Update docs + Add slots to dataclasses + update word & sentence splitting logic + minor changes by @bhavnicksm in #32

New Contributors

Full Changelog: v0.1.2...v0.2.0.post1

v0.1.2

08 Nov 17:26
745e5e8
Compare
Choose a tag to compare

What's Changed

  • Make imports as a part of Chunker init instead of file imports to make Chonkie import faster by @bhavnicksm in #12
  • Run Black + Isort + beautify the code a bit by @bhavnicksm in #13

Full Changelog: v0.1.1...v0.1.2

v0.1.1

07 Nov 19:06
3b1fa22
Compare
Choose a tag to compare

What's Changed

Full Changelog: v0.1.0...v0.1.1

v0.1.0

07 Nov 18:46
977d1d6
Compare
Choose a tag to compare

What's Changed

  • Disentangle the Embedding Model from SemanticChunker + Update DOCS and README by @bhavnicksm in #9

Full Changelog: v0.0.3...v0.1.0

v0.0.3

06 Nov 15:41
68e3272
Compare
Choose a tag to compare

What's Changed

  • Bump version to 0.0.2 in pyproject.toml and init.py for release by @bhavnicksm in #6
  • Update README.md + remove .github action by @bhavnicksm in #7
  • Bump version to 0.0.3 in pyproject.toml and init.py for release by @bhavnicksm in #8

Full Changelog: v0.0.2...v0.0.3