-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement caching for evaluated prompts #44
Comments
I don't understand why we don't just use interactive mode. Almost all the users coming from llama.CPP Are used to an interface where they send a message and get a quick response because there is no clearing of state between messages from the user, meaning there's also no need to reload the state. As I understood it, the KV cache was a way to store the prompt state, because it is used multiple times during the course of a conversation and helps improve responsiveness during long conversations. Given the way that it's used in the base llama.CPP executable, and the fact that in the current implementation of interactive mode, storing of the entire conversation state won't improve performance (it would only allow continuation of previous conversations during a different session), I don't know that this is something that they're going to add in the immediate future. For me, being able to get completions from the bot with full context of the ongoing conversation is my main use case. So there's pretty much no situation where I would want the current conversation context cleared or reset. And I thought this was similar for the openAI implementation, where you send the current message, but don't need to send the full message history. Any type of recomputation or loading of model state decreases the performance and makes it slower than the base llama.CPP implementation imo. After all, I think if people are using chat mode, from a user perspective, they want a continuous and performant chat. Even if that means running models and independent contexts simultaneously, which reduces scalability in the short term without the ability to load and save the states. |
@MillionthOdin16 are you talking about the OpenAI server or just using the Llama class? For the actual OpenAI API each request is entirely independent of all other requests (e.g. you always send the full history to the /chat/completions endpoint) so you do need to reset the model each time. This is why I'm looking into the KV state solution so we can just reload the state if we've seen e.g the first n-1 messages of an n message chat get sent over. If you're just looking for interactive mode though I believe that's been implemented in this example if you just want to use it in a program and don't care about the API https://github.com/abetlen/llama-cpp-python/blob/main/examples/low_level_api/low_level_api_chat_cpp.py |
I'm talking about the openAI server. My point is from the user perspective, the most important factor of chat completions is speed of responses. Unfortunately, llama.CPP takes longer to process an initial prompt the longer it is. So for the chat completions endpoint, this creates an issue because the longer the conversation is, the longer it's going to take to get a response. The reason we have the issue is because the CPP implementation of llama is different than the normal GPU implementations relating to processing of the prompt before it generates a response. So I'm saying that the most efficient solution in this instance might be to not clear the context and save that processing time for each subsequent completion by keeping the session going. It diverges from how openAI implements it, but it's the only option we have right now. And chat completions isn't usable without it because it's too slow. |
I'm basically advocating for a temporary hack to prevent the context from being cleared during chat completions so that we get a significant performance boost, until either we get a proper state saving capability, or the prompt processing time issue is resolved. The issue is frustrating because we're so close to having an API that is performant and chat capable, but there's just a couple things holding it back, and I'm advocating for a temporary hack to allow good performance, until we can actually properly implement it. |
@MillionthOdin16 Is that still meaningfully so since the recent performance regressions appear to have been fixed?
|
So that's not what I mean in this case. I created issue 603 on llama.cpp and now that we have that performance boost, it would be awesome to get as much boost in the API over here as we can. I meant the issue/undetermined cause here: ggerganov/llama.cpp#719 I've seen people more familiar with LLMs mention some oddities about the initial processing in the past, but haven't seen a straightforward explanation. As I understand, it appears llama.cpp has some differences between how it processes the initial data before generating tokens, and it's much slower than the transformer's implementation (CPU vs GPU aside). So I was just saying that if we get a performance boost from that avenue, as well as the ability to store a conversation's state, a proper implementation will be much faster than atm, and we wouldn't need this workaround. Hope that makes sense. Right now there's so many small, different issues going on in the main repo that it's hard to keep track of haha. |
+1 to this. Many people are requesting this feature here: oobabooga/text-generation-webui#866 It would be nice to have a |
@oobabooga I'm still working on the caching API but for now I've added a |
@abetlen is it safe to use |
@oobabooga What E.g. You feed in the prompt "My favorite season is", the model replies "spring because everything is growing. I like to walk". Generation stops. You feed in to generate "outside in the spring weather" (not the full prompt!) and to the model the full prompt is now "My favorite season is spring because everything is growing. I like to walk outside in the spring weather". I tested this in your own webui 😁 by setting This is getting a bit off-topic but to implement this in the webui I think the easiest way would be to save the prompt right before sending it to generate. Then when the user calls generate again you can compare the saved prompt with the user's prompt to see if they are merely appending to the existing prompt or are editing it. If it is an append call with |
Will this be exposed through the REST API at some point? |
@oobabooga @eiery @gjmulder this is now pushed to main, just looking for someoone to test The process to set the cache from code is llama = llama_cpp.Llama(...)
llama.set_cache(llama_cpp.LlamaCache) then you can call If you're using the REST server it's enough to set the If it works like you guys expect I'll publish to PyPI tonight or tomorrow. |
|
@abetlen I have made a test where I generated 80 tokens, and then generated another 80 tokens on top of the result without modifying the prompt. These were the results: With
Without set_cache:
I can see that there is a new |
@gjmulder Correct @oobabooga woops, for generate I implemented the check but didn't actually remove the old tokens from the list of tokens to eval. Should be fixed now. |
@abetlen Caching is working well for me in your latest release 🎊 . I'm running it using a modified oobabooga UI with
|
@eiery very glad to hear! Hopefully, the llama_state api is figured out in the base library soon and then we're really talking, then we can just restore to the longest matching saved state in an LRU cache or something. |
@gjmulder or anyone else able to test the server? It's been working on my end but want an independent confirmation. |
Having such a cache would be helpful indeed especially if you do frequent editing. You could also afford to generate multiple replies with different parameters and let the user choose which one they like best. Honestly if that's implemented performance should be excellent until you hit the 2048 token limit and need to rotate the buffer/do tricks like summarization. I guess caching of the initial prompt will help if it's a long one but ingesting over a thousand tokens for every generation will tack on a couple minutes every time. Luckily there are smart people at llama.cpp working on that... |
@oobabooga @eiery Okay I've pushed the 0.1.34 release to pypi and the wheels should be building right now. This includes the new cache api. I'll keep this issue open until to track the proper cache support and close #68 |
I have made a new test with |
I grabbed this. Confirmed speeds are up when hitting cache. Good times. Getting ~1t/s on a 5950X with a 30b model compared to ~0.2t/s before. No errors so far. I will say that I'd somewhat expect clicking the continue button to always hit cache, but that has not been the case. Not sure if it's a context order issue (the context isn't being updated after the next send, rather than after the end of the generation) or a more naive comparison method (comparing the entire context buffer to the most recent context entirely and any mismatch forces a full regen), but I would expect to cache hit clicking continue in webui, assuming no edits to existing context. That could be non-trivial, but kobold.cpp's smartcontext implementation has helped there. Different use case (maintaining world/character data at the head of the context stack), obviously, but a chunk-based cache comparison could be valuable. I will say, I don't know enough about how/whether context compounds, so maybe keeping later chunks unchanged would be a problem if you regenerate the first context without regenerating everything after it. |
Can I confirm that the cache is only for the I gave up on using the chat completions end point as it seemed to not understand roles and users using alpacas. I'm now using |
@gjmulder this should actually work for both APIs, is it not working for |
@abetlen I might be being stupid here... how do I tell for certain that it is enabled? I'd need to generate the same text from the same prompt with and without caching. |
@gjmulder For the completion endpoint you would just need to pass in the |
Currently open source models we work with are not great enough to provide clean output that has no need for correction. Is there option to keep cache not for the last generated prompt but also or instead prompt for 1 message before. This would enable user edit last response and regenerate messages in trade for a minor latency increase. I seen idea of making two llama.cpp instances to run in parallel where one is just used to store "state" in case it's addressed and they exchange those between each other following user's actions. |
ggerganov/llama.cpp#1105 |
The above was merged, we should be able to set cache as needed |
@snxraven merged in the low-level api here too. Currently working on an implementation for |
Thank you for the hard work! I am super excited for this!!! |
There is much hope for the success. With only 1608.00 MB of RAM required for 13B models, the cost is quite affordable. If we can obtain a state from just one message prior, we would be able to freely regenerate outputs and make necessary fixes. This approach would reduce the context awaiting evaluation to only the most recent output and the new prompt, which would require minimal delay given the speedy 25ms/token processing facilitated by cuBLAS. It seems that we are very close to achieving a highly performative implementation for a very large context. This would be especially beneficial for characters because their context doesn't change and written as personality, ability to not re-evaluate it each time is a key to large descriptions. We can go far beyond 2k tokens with this trick in the future. |
Also one another observation. When you get close to context window limit it begins to cut older messages. As a result caching effectively doesn't work anymore because previous prompt never match. Is there anything that could be done about it? |
@snxraven and all, I've pushed the first version of the cache API, it uses the new llama.cpp state api but currently only limited to a single saved state per cache. |
@abetlen I attempted this with my current codebase but I am only able to see cache misses. Has anything changed in the new code that would break my standard chat history? This is the current code which was running the temp cache system: |
@snxraven I did fix a bug that caused unnecessary cache misses but that should be in after v0.1.37, I would also check the |
@abetlen Ill give the docs test a try, My test with my codebase was on 0.1.38. Ill post my findings here later :) |
I'm working through two issues at the moment with the caching, one is the best way to store the llama_state, for a 7B 4bit quantized model with a context length of 2048 tokens, the llama_state size is ~ 1Gb. Currently I'm keepingt the (single) llama_state object in memory for simplicity but it's clear I'd probably need to move to a filesystem cache to handle larger cache sizes. The other issue I'm trying to resolve is when to save the llama_state, currently this is done after prompt processing which makes sense for basic completions, however for chats this means that the last returned message has to be re-processed each time. |
This couldn't be a problem. Swipes, regeneration and editing of a last received message entirely rely on state being preserved before generation of the latest message. With average length of output almost never exceeding 200 tokens (~5 seconds) it's very cheap price to pay to re-process. Once we get handle on these things i believe one could make a toggle for preferences of a user, but for the first implementation i believe this is a good compromise. Amount of control user obtains for the price of extra reprocessing is immensurable. Moreover, GUI implementation that are placed above output, like a SillyTavern rely on editing of the LLM's output in an automatic mode without user's input. Those cut extra lines and unnecessary symbols and so on. On the level of llama-cpp-python there is literally no option to know what has been cut or changed, if state is saved after output is done it would never match.
If we get an option to write down cache on an SSD, putting reducing RAM requirements aside it would right away enable us with ability to anytime load permanent cache state that has description, dialog examples and other information already processed. Not only it allows us to freely work with other temporary context (summarize, truncate, edit), but it also enables ability to chat simultaneously with multiple personalities with only one model being loaded. The power it actually brings is hard to overestimate, it's so much more than just free RAM. |
I'm closing this issue favour of #158 The current behaviour of |
The goal of this feature is to reduce latency for repeated calls to the chat_completion api by saving the kv_cache keyed by the prompt tokens.
The basic version of this is to simply save the kv_state after the prompt is generated.
Additionally we should investigate if it's possible save and restore the kv_state after the completion has been generated as well.
The text was updated successfully, but these errors were encountered: