Implement chat continuation #68

abetlen · 2023-04-11T04:09:35Z

As suggested by @MillionthOdin16 because implementing #44 is taking longer than expected we should add a simple form of chat continuation if the previous message history matches. ie

Request 1

{ "messages": [msg1, msg2] }

Response 1

{ "messages": [msg1, msg2, msg3] }

Request 2

{ "messages": [msg1, msg2, msg3, msg4] }

In this case we only need to process msg4 and return a new msg5.

The text was updated successfully, but these errors were encountered:

MillionthOdin16 · 2023-04-11T04:23:35Z

Thanks. I think figuring out a way to get this working will jumpstart further integrations with other projects by making high-speed completions for chat more easily accessible.

I don't know the best way for keeping track of a chat session (i.e. matching messages to an existing context). I think I saw previously that you might have some user information, or maybe we could use the completion ID somehow?

And just to be clear, I'm not expecting this to scale to a significant number of conversations simultaneously. My idea is just that it gives us the ability to continue generation on an existing context without having to reinitialize it. So maybe a couple... Or even just one would be a great performance boost.

And once we do have state saving, or the ability to quickly load up significant message history, then this workaround would no longer be needed.

MillionthOdin16 · 2023-04-11T04:27:13Z

And I guess this could even be a separate endpoint that accepts the same data as the actual chat completions endpoint, but is implemented differently with a single persistent model and context that just adds the most recent message from the user to the message history and returns the generation. If that makes sense.

I didn't know if your initial message for this issue was the format you wanted to use or not.

abetlen · 2023-04-11T04:34:51Z

No that was just trying to describe how to check which messages to process. Basically if one list is a prefix of the other than just process the difference of the two lists, otherwise we need to start processing from scratch.

As for the API I'll keep the endpoint the same, unfortunately this workaround isn't gauranteed to work the same as re-processing from scratch but it's probably worth it to get this functionality in sooner rather than later.

djaffer · 2023-04-11T16:33:42Z

Does api accept chat history?

abetlen · 2023-04-11T16:44:25Z

@djaffer Yes if you check the docs it's identical to the OpenAI api where you send in the entire chat history https://platform.openai.com/docs/api-reference/chat

abetlen · 2023-04-11T23:45:41Z

Okay, I've got a basic implementation, just need to clean up some issues but generally it's working as expected (chats are significantly more responsive).

The way I'm implementing this is through a LlamaCache abstract class and then this specific version is a ContextCache who's basic strategy is to just check wether it's already processed the first part of a message and if so start from just the new tokens. This should also allow us to extend this with a KVStateCache in the future when I can work out the limitations with that API.

@MillionthOdin16 in the #17 issue could you give me a hand identifying the prompt formats for various models? For text completion this is user provided but for the chat it's a little bit more challenging (as illustrated above) because the finetuned models each expect a slightly different format.

abetlen · 2023-04-15T16:04:43Z

@MillionthOdin16 this is now pushed to main (not PyPI yet) if you get the chance can you test it out from source? All you need to do for the server is run with with CACHE=1 environment variable set.

giga-Ryan · 2023-05-26T12:14:37Z

How do I implement this? I'm completely new to this thing, and have zero coding knowledge beyond hello world. How can I put this to use in in termux on Android 13?

abetlen added enhancement New feature or request high-priority labels Apr 11, 2023

abetlen pinned this issue Apr 11, 2023

ghost mentioned this issue Apr 12, 2023

Using llama.cpp, the entire context gets reprocessed each generation oobabooga/text-generation-webui#866

Closed

0xdevalias mentioned this issue Apr 14, 2023

Fix for MPS support on Apple Silicon oobabooga/text-generation-webui#393

Merged

abetlen mentioned this issue Apr 16, 2023

Implement caching for evaluated prompts #44

Closed

abetlen closed this as completed Apr 16, 2023

abetlen unpinned this issue Apr 16, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement chat continuation #68

Implement chat continuation #68

abetlen commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023

abetlen commented Apr 11, 2023

djaffer commented Apr 11, 2023

abetlen commented Apr 11, 2023

abetlen commented Apr 11, 2023

abetlen commented Apr 15, 2023 •

edited

Loading

giga-Ryan commented May 26, 2023

Implement chat continuation #68

Implement chat continuation #68

Comments

abetlen commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023

MillionthOdin16 commented Apr 11, 2023

abetlen commented Apr 11, 2023

djaffer commented Apr 11, 2023

abetlen commented Apr 11, 2023

abetlen commented Apr 11, 2023

abetlen commented Apr 15, 2023 • edited Loading

giga-Ryan commented May 26, 2023

abetlen commented Apr 15, 2023 •

edited

Loading