Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement chat continuation #68

Closed
abetlen opened this issue Apr 11, 2023 · 8 comments
Closed

Implement chat continuation #68

abetlen opened this issue Apr 11, 2023 · 8 comments
Labels
enhancement New feature or request high-priority

Comments

@abetlen
Copy link
Owner

abetlen commented Apr 11, 2023

As suggested by @MillionthOdin16 because implementing #44 is taking longer than expected we should add a simple form of chat continuation if the previous message history matches. ie

Request 1

{ "messages": [msg1, msg2] }

Response 1

{ "messages": [msg1, msg2, msg3] }

Request 2

{ "messages": [msg1, msg2, msg3, msg4] }

In this case we only need to process msg4 and return a new msg5.

@abetlen abetlen added enhancement New feature or request high-priority labels Apr 11, 2023
@abetlen abetlen pinned this issue Apr 11, 2023
@MillionthOdin16
Copy link
Contributor

Thanks. I think figuring out a way to get this working will jumpstart further integrations with other projects by making high-speed completions for chat more easily accessible.

I don't know the best way for keeping track of a chat session (i.e. matching messages to an existing context). I think I saw previously that you might have some user information, or maybe we could use the completion ID somehow?

And just to be clear, I'm not expecting this to scale to a significant number of conversations simultaneously. My idea is just that it gives us the ability to continue generation on an existing context without having to reinitialize it. So maybe a couple... Or even just one would be a great performance boost.

And once we do have state saving, or the ability to quickly load up significant message history, then this workaround would no longer be needed.

@MillionthOdin16
Copy link
Contributor

And I guess this could even be a separate endpoint that accepts the same data as the actual chat completions endpoint, but is implemented differently with a single persistent model and context that just adds the most recent message from the user to the message history and returns the generation. If that makes sense.

I didn't know if your initial message for this issue was the format you wanted to use or not.

@abetlen
Copy link
Owner Author

abetlen commented Apr 11, 2023

No that was just trying to describe how to check which messages to process. Basically if one list is a prefix of the other than just process the difference of the two lists, otherwise we need to start processing from scratch.

As for the API I'll keep the endpoint the same, unfortunately this workaround isn't gauranteed to work the same as re-processing from scratch but it's probably worth it to get this functionality in sooner rather than later.

@djaffer
Copy link

djaffer commented Apr 11, 2023

Does api accept chat history?

@abetlen
Copy link
Owner Author

abetlen commented Apr 11, 2023

@djaffer Yes if you check the docs it's identical to the OpenAI api where you send in the entire chat history https://platform.openai.com/docs/api-reference/chat

@abetlen
Copy link
Owner Author

abetlen commented Apr 11, 2023

image

Okay, I've got a basic implementation, just need to clean up some issues but generally it's working as expected (chats are significantly more responsive).

The way I'm implementing this is through a LlamaCache abstract class and then this specific version is a ContextCache who's basic strategy is to just check wether it's already processed the first part of a message and if so start from just the new tokens. This should also allow us to extend this with a KVStateCache in the future when I can work out the limitations with that API.

@MillionthOdin16 in the #17 issue could you give me a hand identifying the prompt formats for various models? For text completion this is user provided but for the chat it's a little bit more challenging (as illustrated above) because the finetuned models each expect a slightly different format.

@abetlen
Copy link
Owner Author

abetlen commented Apr 15, 2023

@MillionthOdin16 this is now pushed to main (not PyPI yet) if you get the chance can you test it out from source? All you need to do for the server is run with with CACHE=1 environment variable set.

@giga-Ryan
Copy link

How do I implement this? I'm completely new to this thing, and have zero coding knowledge beyond hello world. How can I put this to use in in termux on Android 13?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request high-priority
Projects
None yet
Development

No branches or pull requests

4 participants