Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenChunker does not support multiple inputs #18

Closed
not-lain opened this issue Nov 10, 2024 · 5 comments
Closed

TokenChunker does not support multiple inputs #18

not-lain opened this issue Nov 10, 2024 · 5 comments
Assignees
Labels
bug Something isn't working enhancement New feature or request

Comments

@not-lain
Copy link

Issue

I ran the following example provided in the readme file

# First import the chunker you want from Chonkie 
from chonkie import TokenChunker

# Import your favorite tokenizer library
# Also supports AutoTokenizers, TikToken and AutoTikTokenizer
from tokenizers import Tokenizer 
tokenizer = Tokenizer.from_pretrained("gpt2")

# Initialize the chunker
chunker = TokenChunker(tokenizer)

# Chunk some text
chunks = chunker("Woah! Chonkie, the chunking library is so cool!",
                  "I love the tiny hippo hehe.")

# Access chunks
for chunk in chunks:
    print(f"Chunk: {chunk.text}")
    print(f"Tokens: {chunk.token_count}")

and I was running into the following error

TypeError                                 Traceback (most recent call last)
[<ipython-input-2-bb5f7fdb45bb>](https://localhost:8080/#) in <cell line: 13>()
     11 
     12 # Chunk some text
---> 13 chunks = chunker("Woah! Chonkie, the chunking library is so cool!","I love the tiny hippo hehe.")
     14 
     15 # Access chunks

TypeError: BaseChunker.__call__() takes 2 positional arguments but 3 were given

extra information

I would suggest either updating the example on the readme file or updating the BaseChunker to support multiple inputs at the same time.
the latter is my go-to suggestion since it can process multiple samples at the same time, we can either support lists here or args, preferably lists since the tokenizers library already supports lists already.

@not-lain not-lain added the bug Something isn't working label Nov 10, 2024
@bhavnicksm bhavnicksm added the enhancement New feature or request label Nov 10, 2024
@bhavnicksm
Copy link
Collaborator

bhavnicksm commented Nov 10, 2024

Hey @not-lain,

WOAH 😳 that's a bit embarrasing, haha
I'll fix the example in the README.md, right now!

Regarding adding batching/list support, I plan to add multiprocessing support (via MPIRE) soon, so we can run these parallely~! Multiprocessing because I want Chonkie to be the fastest even with Batching.

Would really appreciate PRs if you're willing to work on this.

bhavnicksm added a commit that referenced this issue Nov 10, 2024
@not-lain
Copy link
Author

On it 🫡

@not-lain not-lain self-assigned this Nov 10, 2024
@bhavnicksm
Copy link
Collaborator

bhavnicksm commented Nov 10, 2024

Hey @not-lain!

We can probably add a method to the BaseChunker class, named chunk_batch that can run chunk via multiprocessing. So whenever we add new Chunkers in the future, we don't need to re-implement the chunk_batch function.

And we can expose the num_proc for the chunk_batch as an optional parameter on the method.

How does that sound?

bhavnicksm added a commit that referenced this issue Nov 11, 2024
@bhavnicksm
Copy link
Collaborator

Hey @not-lain,

Just added initial support for batching in the BaseChunker via multiprocessing library in python in #28, this is definitely not the most optimal way to go about chunking but is merely to serve as a placeholder for when we build more optimal chunking approaches.

I'd be happy to accept PRs for "native" batching approaches in TokenChunker and other chunkers, that work without multiprocessing.

For now, I think we can close this issue and make different issues for "native" batching support on the various chunkers.

Thanks 😊

@not-lain
Copy link
Author

Awesome, was thinking of doing this over the weekend, but glad it was already implented.
Nice work 🙌

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants